* chore: send warning instead of error for unseen metrics and missing (metric, key)
* chore: update integration test
* chore: fix integration test
* chore: fix test
* chore: add unit test for missing key
* feat: extend error responses with new error struct
* fix: enriched error for dashboard api
* fix: merge issues
* fix: reverted dashboards changes and add for cloud integrations
* fix: delete file
* fix: add back file
* fix: added a helper
* fix: removed invlaid referencess
* fix: generate openapi
* fix: keeping additional along with suggestion
* Revert "fix: keeping additional along with suggestion"
This reverts commit be30e2ffd2.
* fix: added suggestions per additonal error
* fix: generate openapi
* fix: remove valid references
* fix: removeg valid references for select and group by and only did you mean is kept
* fix: unit test
* fix: use binding for deconding for both ee and community
* fix: trim down suggestions methods
* fix: added renamed methods and moved stuff around
* fix: typo
* fix: removed json decoder
* fix: added empty check
* fix: retain addtional
* fix: reverted re-structing of file
* feat(statsreporter): expose collected stats via GET /api/v1/stats
Extract per-org stats collection out of the analytics reporter into an
always-on Aggregator (collector fan-out + telemetry-store counts) shared
by the reporter and a new HTTP handler. The GET /api/v1/stats endpoint
returns the caller's org stats regardless of whether scheduled reporting
is enabled.
* refactor(statsreporter): collect telemetry stats via the querier
Move the trace/log/metric row-count and last-observed queries out of the
stats aggregator and into the querier, which now implements
statsreporter.StatsCollector. The aggregator becomes a pure collector
fan-out and no longer depends on telemetrystore; the querier is wired in
as one of the stats collectors.
* chore: regenerate openapi spec and frontend client
Backend docs/api/openapi.yml gains the GET /api/v1/stats (GetStats)
operation; the Orval client gains a useGetStats hook and GetStats200
type.
* chore: remove comment from querier Collect
* fix(statsreporter): use MustNewUUID for org from claims
Claims come from validated auth context, so the org UUID is guaranteed
valid; drop the dead NewUUID error branch.
* fix(flagger): use MustNewUUID for org from claims
Claims come from validated auth context, so the org UUID is guaranteed
valid; drop the dead NewUUID error branch.
* docs(contributing): note MustNewUUID for IDs from claims
* perf(querier): combine count and last-observed into one query per signal
Each signal's COUNT(*) and max(timestamp) scan the same table, so fetch
both in a single query — 3 queries instead of 6. Same emitted keys and
empty-table guard.
* fix: add check for percentile aggregation for non-histogram metrics
* test: correct errors pkg in test file
* fix: catch type related errors in querier
* fix: remove comparison related tests
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
* fix: show warning for non-existent cost meter metrics
* chore: lint fix by removing unused list
* chore: py fmt add new line
* chore: missing newline between tests
* fix: no warnings or errors for internal metrics
* fix: pylint fix by adding new line
* fix: lint fix in test
* fix(telemetrystore): upgrade clickhouse-go to v2.44.0 to fix connection-pool slot leak
clickhouse-go v2.43.0 introduced connection-pool slot leaks triggered by context
cancellation: acquire() failed to release the pool slot when idle.Get returned a
cancellation error (ClickHouse/clickhouse-go#1759), and batch.Close() never released
the connection when closeQuery() failed on a cancelled context
(ClickHouse/clickhouse-go#1795). Both leak slots until the pool is exhausted and every
query fails with 'acquire conn timeout'. Both are fixed in v2.44.0.
v2.44.0 adds HasData() to the driver.Rows interface, which the test mock did not
implement. Swap the mock to the SigNoz fork github.com/SigNoz/clickhouse-go-mock
v0.14.0, which implements HasData() and tracks v2.44.0.
* feat(telemetrystore): emit clickhouse connection-pool metrics
Register OTel observable gauges that report the clickhouse connection-pool stats
from driver.Stats() on each collection cycle:
signoz.telemetrystore.connection.{open,idle,max_open,max_idle}. Plotting open against
max_open makes pool saturation (and leaks like the one fixed in the previous commit)
directly observable in Prometheus.
* fix: order by ignored in formula query
* fix: order by ignored in formula query
* fix: added intergation test
* fix: revert integarion test changes
* fix: added an independent integration test
* fix: make py-fmt
* fix: removed comment
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
Co-authored-by: Pandey <vibhupandey28@gmail.com>
* fix(deps): upgrade dependencies to resolve high/critical security alerts
Upgrade pgx/v5 (v5.8.0→v5.9.2), prometheus (v0.310.0→v0.311.3),
gosaml2 (v0.9.0→v0.11.0), goxmldsig (v1.2.0→v1.6.0), and
urllib3 (2.6.3→2.7.0) to fix all open high/critical Dependabot alerts.
Adapt parser.ParseExpr calls to use the new Parser interface introduced
in prometheus v0.311.x.
* refactor: reuse a single PromQL parser instance instead of creating per call
Add Parser() to the prometheus.Prometheus interface so a single
parser.Parser is created at startup and shared across all consumers.
For the legacy v2 querier and PromQLFilterExtractor (which don't have
access to the Prometheus interface), store a parser instance on the
struct, created once during construction.
* refactor: centralize PromQL parser creation via prometheus.NewParser()
Add pkg/prometheus/parser.go with a Parser type alias and NewParser()
factory function, mirroring the existing Engine/NewEngine pattern.
All consumers now create parsers through this single entry point
instead of calling parser.NewParser(parser.Options{}) directly.
* refactor: move authtypes to coretypes
* refactor: migrate downstream consumers to coretypes Kind/Type/Relation
Wire all consumers of the typeable infrastructure through coretypes:
- Replace authtypes.Name/Type/Relation references with coretypes equivalents
- Switch Typeable singletons to constructor calls (authtypes.NewTypeableUser
etc.), with the embedded coretypes.Typeable populated so Kind/Type/Prefix/
Scope dispatch correctly through the embed
- Update dashboardtypes meta-resource declarations to use authtypes
constructors so they expose Tuples (authz callers need it)
- Rename Resource.Name field accesses to Resource.Kind to match the field
rename in authtypes.Resource
- Fix typeable_metaresource.go calling the plural NewTypeableMetaResources
helper — should be the singular NewTypeableMetaResource
go build ./... and go vet ./... clean (parser-generated unreachable-code
warnings are pre-existing). Authz unit tests pass.
* refactor(audittypes): unify Action with coretypes.Relation
Drop the duplicate Action enum from audittypes — the verbs (create/update/
delete) match coretypes.Relation exactly. Move PastTense onto Relation so
audit EventName derivation continues to work without a parallel hierarchy.
Also retypes AuditDef.ResourceKind from string to coretypes.Kind so audit
declarations get the same regex validation that authz already enforces.
* refactor(retentiontypes): extract TTLSetting into its own package
TTLSetting is the bun model for ClickHouse TTL settings — has nothing to do
with the Organization domain it was previously co-located with in
pkg/types/organization.go. Moved to pkg/types/retentiontypes/ alongside the
ClickHouse reader that's its sole consumer.
No schema change; the bun table tag (table:ttl_setting) is unchanged.
* chore(openapi): regenerate spec for coretypes.Relation and Resource.Kind
* chore(frontend): regenerate API client and migrate Resource.name → Resource.kind
Regenerated TypeScript API types after the AuthtypesResource field rename
and the new CoretypesRelation enum. Updated:
- frontend/scripts/generate-permissions-type.cjs to read `r.kind` from the
/api/v1/authz/resources response and emit `kind:` in the static
permissions.type.ts file.
- frontend/src/hooks/useAuthZ/{permissions.type,types,utils,useAuthZ}.tsx:
Resource.name → Resource.kind throughout.
- frontend/src/container/RolesSettings/{utils.tsx,__tests__/utils.test.ts}:
same field migration.
- frontend/src/components/createGuardedRoute/createGuardedRoute.test.tsx:
same.
- useAuthZ/utils.ts: cast string relations to CoretypesRelationDTO at the
AuthtypesTransactionDTO boundary now that relation is an enum, not a raw
string.
yarn generate:api passes (orval generation + lint + typecheck).
* refactor: migrate downstream consumers to Resource/Verb rename
* chore(openapi): regenerate spec for Resource/Verb rename
* feat(coretypes): add ListResources accessor with stable sort
* feat(cmd): add 'generate authz' subcommand for permissions type
* refactor(authz): drop runtime authz/resources endpoint
* refactor(frontend): consume static permissions.type.ts directly
* chore(frontend): regenerate Orval client without authz/resources
* ci: move authz schema check from jsci to goci
* refactor(coretypes): move Selector/Object/Transaction from authtypes
* feat(coretypes): add managed role names and permission policy
* feat(coretypes): add Registry assembling resources, types, and managed-role transactions
* refactor(authz): wire *coretypes.Registry; drop RegisterTypeable
* refactor(cmd): wire coretypes.NewRegistry into server bootstraps
* chore: regenerate openapi spec for authtypes -> coretypes type moves
* chore(frontend): regenerate API client for Authtypes -> Coretypes type moves
* refactor(coretypes): rename GettableResource to ResourceRef
* refactor(authz): collapse Registry around static data; bridge once at construction
* refactor(coretypes): tighten Registry, restore anonymous public-dashboard grant
Drops passthrough fields from coretypes.Registry; adds an O(1) lookup map
for NewResourceFromTypeAndKind; replaces stringly-typed Type compares with
Type.Equals; removes the now-redundant getUniqueTypes helper. Restores the
signoz-anonymous read grant on metaresource/public-dashboard that was
silently dropped, and removes the invalid signoz-admin/VerbCreate/TypeUser
entry that panicked at startup.
* chore: regenerate openapi spec for coretypes -> authtypes type moves
* chore(frontend): regenerate API client for Coretypes -> Authtypes type moves
* fix(authz): disambiguate kind→type by relation, preserve multi-part selectors
permissions.type.ts now lists the same kind (dashboard, role,
public-dashboard) under both metaresource and metaresources, so the prior
kind→type map silently overwrote one with the other. Resolve the type
using the requesting relation's allowed types, and slice the selector at
the first colon so multi-part selectors (e.g. id:version) round-trip
correctly. Updates useAuthZ.test.tsx to use the regenerated kind field.
* refactor(authtypes): introduce Relation wrapper over coretypes.Verb
The authz layer modeled relations as raw coretypes.Verb everywhere, which
forced authz-level concepts (action, role-binding) to share a type with
schema-level enumerations. Introduce authtypes.Relation as a thin wrapper
over coretypes.Verb so the authz APIs (CheckWithTupleCreation, ListObjects,
GetObjects, PatchObjects, NewTuples, Transaction.Relation, etc.) can grow
authz-specific affordances without leaking back into coretypes.
Also reshuffles the static coretypes data into dedicated registry_*.go files
(types, kinds, verbs, resources, managed roles) to keep the schema declarations
isolated from the value types they configure.
* refactor(authtypes): expose Relation.Enum() and regenerate openapi spec
Without an Enum() method on Relation the openapi generator emitted an
empty AuthtypesRelation schema (no allowed values). Forward the enum
from the embedded coretypes.Verb so the wire contract is faithful.
* refactor(ee/authz): drop always-nil error returns from managed-role tuple helpers
getManagedRoleGrantTuples and getManagedRoleTransactionTuples never
returned a non-nil error, which the linter (unparam) had flagged. Drop
the unused error return; callers no longer need the err check either.
* chore(frontend): regenerate API client for authtypes.Relation
* fix(authz): satisfy go-lint — keyed Relation literal, drop redundant Verb selector
* refactor(coretypes): sync Kinds slice with full registry_kind declarations
* feat(coretypes): register metaresource and metaresources for all new kinds
Adds 21 metaresource and 21 metaresources entries (covering notification-channel,
route-policy, apdex-setting, auth-domain, session, cloud-integration,
cloud-integration-service, ingestion-key, ingestion-limit, pipeline,
user-preference, org-preference, quick-filter, ttl-setting, rule,
planned-maintenance, saved-view, trace-funnel, factor-password, factor-api-key,
license) so the authz schema covers every resource Kind declared in
registry_kind. Regenerates the static frontend permissions.type.ts to match.
* feat(coretypes): populate ManagedRoleToTransactions from signozapiserver routes
Enumerates every (verb, resource) tuple each managed role holds, derived
from the AdminAccess/EditAccess/ViewAccess middleware on routes in
pkg/apiserver/signozapiserver and the legacy http_handler in
pkg/query-service/app. Admin gets 123 transactions, editor 53, viewer 25,
anonymous keeps the single public-dashboard read.
* feat(coretypes): add integration kind with full CRUD for viewer/editor/admin
Install/uninstall/list integration routes (legacy /api/v1/integrations) all
sit behind ViewAccess, so every authenticated role gets the full CRUD
surface on (metaresource, integration) and (metaresources, integration).
Regenerates the static frontend permissions.type.ts to match.
* feat(coretypes): add subscription kind alongside license, document LCRUD shape
License covers the in-product license resource (Activate/Refresh/GetActive).
Subscription is the billing lifecycle (checkout/portal/billing) served by
ee/query-service routes. Both are admin-only and modeled with a uniform
LCRUD shape; comments call out which verbs actually map to routes versus
which are placeholders for shape parity (e.g. cancellation flows through
Stripe's portal, not an in-process delete).
* feat(coretypes): model telemetryresource for logs, traces, metrics
Mirrors the telemetryresource type from ee/authz/openfgaschema/base.fga
into coretypes: a read-only Type with three Kinds (logs, traces, metrics)
matching telemetrytypes.Signal. Selector is wildcard-only for v1; future
work can narrow per-service or per-environment when the use case lands.
Every managed role (admin/editor/viewer) gets read on each signal,
matching the schema's role#assignee grant. Anonymous stays unchanged.
Regenerates the static frontend permissions.type.ts.
* feat(coretypes): add audit-logs and meter-metrics kinds under telemetryresource
Audit logs (signal=logs, source=audit) and meter metrics (signal=metrics,
source=meter) are sensitive source-qualified telemetry streams that don't
belong under the broad read-grant every role gets on regular logs/traces/
metrics. Modeled as distinct Kinds so they can be permissioned
independently. Admin-only read for now; widen on explicit ask (e.g. an
auditor flow that needs viewer access to audit-logs). Regenerates the
static frontend permissions.type.ts.
* feat(coretypes): add logs-field and traces-field kinds for stored field config
GET/POST /logs/fields and /api/v2/traces/fields manage stored, mutable
field metadata (indexed/promoted columns) over each signal. They're
configuration, not telemetry data, so they sit under metaresource rather
than telemetryresource. Viewer reads, editor/admin update; no
create/delete since POST overwrites. Plural prefix (logs-field /
traces-field) matches the signal naming.
* chore(frontend): regenerate permissions.type.ts to match generate authz output
* feat(authz): add attach permissions to fga model
* fix(tests): use role permissions instead of dashboards
* fix(authz): couple of issues with register flow
* fix(authz): public dashboard read should be anomymous
* fix(tests): integration test for public dashboard access
---------
Co-authored-by: vikrantgupta25 <vikrant@signoz.io>
* chore: add json enabled as feature flag for FE
* fix: still using global bool
* feat: flagger integration in flow
* fix: flagger threaded into tests
* test: removed nil checks
* fix: minor changes
* chore: rename field
* chore: remove querybuilder helper
* fix: unit tests
* fix: correct env var
* fix: lint fix
* fix: lint
* chore: replace flag
* feat(audit): add telemetry audit query infrastructure
Add pkg/telemetryaudit/ with tables, field mapper, condition builder,
and statement builder for querying audit logs from signoz_audit database.
Add SourceAudit to source enum and integrate audit key resolution
into the metadata store.
* chore: address review comments
Comment out SourceAudit from Enum() until frontend is ready.
Use actual audit table constants in metadata test helpers.
* fix(audit): align field mapper with actual audit DDL schema
Remove resources_string (not in audit table DDL).
Add event_name as intrinsic column.
Resource context resolves only through the resource JSON column.
* feat(audit): add audit field value autocomplete support
Wire distributed_tag_attributes_v2 for signoz_audit into the
metadata store. Add getAuditFieldValues() and route SignalLogs +
SourceAudit to it in GetFieldValues().
* test(audit): add statement builder tests
Cover all three request types (list, time series, scalar) with
audit-specific query patterns: materialized column filters, AND/OR
conditions, limit CTEs, and group-by expressions.
* refactor(audit): inline field key map into test file
Remove test_data.go and inline the audit field key map directly
into statement_builder_test.go with a compact helper function.
* style(audit): move column map to const.go, use sqlbuilder.As in metadata
Move logsV2Columns from field_mapper.go to const.go to colocate all
column definitions. Switch getAuditKeys() to use sb.As() instead of
raw string formatting. Fix FieldContext alignment.
* fix(audit): align table names with schema migration
Migration uses logs/distributed_logs (not logs_v2/distributed_logs_v2).
Rename LogsV2TableName to LogsTableName and LogsV2LocalTableName to
LogsLocalTableName to match the actual signoz_audit DDL.
* feat(audit): add integration test fixture for audit logs
AuditLog fixture inserts into all 5 signoz_audit tables matching
the schema migration DDL: distributed_logs (no resources_string,
has event_name), distributed_logs_resource, distributed_tag_attributes_v2,
distributed_logs_attribute_keys, distributed_logs_resource_keys.
* fix(audit): rename tag_attributes_v2 to tag_attributes
Migration uses tag_attributes/distributed_tag_attributes (no _v2
suffix). Rename constants and update all references including the
integration test fixture.
* feat(audit): wire audit statement builder into querier
Add auditStmtBuilder to querier struct and route LogAggregation
queries with source=audit to it in all three dispatch locations
(main query, live tail, shiftedQuery). Create and wire the full
audit query stack in signozquerier provider.
* test(audit): add integration tests for audit log querying
Cover the documented query patterns: list all events, filter by
principal ID, filter by outcome, filter by resource name+ID,
filter by principal type, scalar count for alerting, and
isolation test ensuring audit data doesn't leak into regular logs.
* fix(audit): revert sb.As in getAuditKeys, fix fixture column_names
Revert getAuditKeys to use raw SQL strings instead of sb.As() which
incorrectly treated string literals as column references. Add explicit
column_names to all ClickHouse insert calls in the audit fixture.
* fix(audit): remove debug assertion from integration test
* feat(audit): internalize resource filter in audit statement builder
Build the resource filter internally pointing at
signoz_audit.distributed_logs_resource. Add LogsResourceTableName
constant. Remove resourceFilterStmtBuilder from constructor params.
Update test expectations to use the audit resource table.
* fix(audit): rename resource.name to resource.kind, move to resource attributes
Align with schema change from SigNoz/signoz#10826:
- signoz.audit.resource.name renamed to signoz.audit.resource.kind
- resource.kind and resource.id moved from event attributes to OTel
Resource attributes (resource JSON column)
- Materialized columns reduced from 7 to 5 (resource.kind and
resource.id no longer materialized)
* refactor(audit): use pytest.mark.parametrize for filter integration tests
Consolidate filter test functions into a single parametrized test.
6/8 tests passing; resource kind+ID filter and scalar count need
further investigation (resource filter JSON key extraction with
dotted keys, scalar response format).
* fix(audit): add source to resource filter for correct metadata routing
Add source param to telemetryresourcefilter.New so the resource
filter's key selectors include Source when calling GetKeysMulti.
Without this, audit resource keys route to signoz_logs metadata
tables instead of signoz_audit. Fix scalar test to use table
response format (columns+data, not rows).
* refactor(audit): reuse querier fixtures in integration tests
Add source param to BuilderQuery and build_scalar_query in the
querier fixture. Replace custom _build_audit_query and
_build_audit_ts_query helpers with BuilderQuery and
build_scalar_query from the shared fixtures.
* refactor(audit): remove wrapper helpers, inline make_query_request calls
Remove _query_audit_raw and _query_audit_scalar helpers. Use
make_query_request, BuilderQuery, and build_scalar_query directly.
Compute time window at test execution time via _time_window() to
avoid stale module-level timestamps.
* refactor(audit): inline _time_window into test functions
* style(audit): use snake_case for pytest parametrize IDs
* refactor(audit): inline DEFAULT_ORDER using build_order_by
Use build_order_by from querier fixtures instead of OrderBy/
TelemetryFieldKey dataclasses. Allow BuilderQuery.order to accept
plain dicts alongside OrderBy objects.
* refactor(audit): inline all data setup, use distinct scenarios per test
Remove _insert_standard_audit_events helper. Each test now owns its
data: list_all uses alert-rule/saved-view/user resource types,
scalar_count uses multiple failures from different principals (count=2),
leak test uses a single organization event. Parametrized filter tests
keep the original 5-event dataset.
* fix(audit): remove silent empty-string guards in metadata store
Remove guards that silently returned nil/empty when audit DB params
were empty. All call sites now pass real constants, so misconfiguration
should fail loudly rather than produce silent empty results.
* style(audit): remove module docstring from integration test
* style: formatting fix in tables file
* style: formatting fix in tables file
* fix: add auditStmtBuilder nil param to querier_test.go
* fix: fix fmt
* fix: show warning for non-existent cost meter metrics
* chore: lint fix by removing unused list
* chore: py fmt add new line
* fix: missing metric check on type instead of temporality
* test: fix unit tests by mocking type data
* test: unit tests
* revert: revert changes from meter branch
* revert: revert changes from meter branch
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
* refactor: move resourcefilter to pkg/telemetryresourcefilter
Move pkg/querybuilder/resourcefilter to pkg/telemetryresourcefilter
to align with the existing telemetry package naming convention
(telemetrylogs, telemetrytraces, telemetrymetrics, telemetrymeter).
The resource filter is a statement builder, not a query builder utility.
* refactor: internalize resource filter construction in statement builders
Each telemetry statement builder (logs, traces) now creates its own
resource filter internally instead of receiving it as an injected
dependency. This makes it impossible to wire the wrong resource table
and simplifies the provider.
Delete telemetryresourcefilter/tables.go — each telemetry package now
owns its resource table constant (LogsResourceV2TableName in
telemetrylogs, TracesResourceV3TableName in telemetrytraces).
* refactor: create field mapper and condition builder inside resource filter New
Remove fieldMapper and conditionBuilder params from
telemetryresourcefilter.New — they are always the same
(NewFieldMapper + NewConditionBuilder) so create them internally.
* fix: warning instead of error for dormant metrics in query range API
* fix: add missing else
* fix: keep track of present aggregations
* fix: note present aggregation after type is set
* test: integration test fix and new test
* chore: lint errors
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
* feat(serviceaccount): integrate service account
* feat(serviceaccount): integrate service account with better types
* feat(serviceaccount): fix lint and testing changes
* feat(serviceaccount): update integration tests
* feat(serviceaccount): fix formatting
* feat(serviceaccount): fix openapi spec
* feat(serviceaccount): update txlock to immediate to avoid busy snapshot errors
* feat(serviceaccount): add restrictions for factor_api_key
* feat(serviceaccount): add restrictions for factor_api_key
* feat: enabled service account and deprecated API Keys (#10715)
* feat: enabled service account and deprecated API Keys
* feat: deprecated API Keys
* feat: service account spec updates and role management changes
* feat: updated the error component for roles management
* feat: updated test case
* feat: updated the error component and added retries
* feat: refactored code and added retry to happend 3 times total
* feat: fixed feedbacks and added test case
* feat: refactored code and removed retry
* feat: updated the test cases
---------
Co-authored-by: SagarRajput-7 <162284829+SagarRajput-7@users.noreply.github.com>
* fix(querier): return proper HTTP status for PromQL timeout errors
PromQL queries hitting the context deadline were incorrectly returning
400 Bad Request with "invalid_input" because enhancePromQLError
unconditionally wrapped all errors as TypeInvalidInput. Extract
tryEnhancePromQLExecError to properly classify timeout, cancellation,
and storage errors before falling through to parse error handling.
Also make the PromQL engine timeout configurable via prometheus.timeout
config (default 2m) instead of hardcoding it.
* chore: refactor files
* fix(prometheus): validate timeout config and fix test setups
Add validation in prometheus.Config to reject zero timeout. Update all
test files to explicitly set Timeout: 2 * time.Minute in prometheus.Config
literals to avoid immediate query timeouts.
* feat(middleware): add panic recovery middleware with TypeFatal error type
Add a global HTTP recovery middleware that catches panics, logs them
with OTel exception semantic conventions via errors.Attr, and returns
a safe user-facing error response. Introduce TypeFatal/CodeFatal for
unrecoverable failures and WithStacktrace to attach pre-formatted
stack traces to errors. Remove redundant per-handler panic recovery
blocks in querier APIs.
* style(errors): keep WithStacktrace call on same line in test
* fix(middleware): replace fmt.Errorf with errors.New in recovery test
* feat(middleware): add request context to panic recovery logs
Capture request body before handler runs and include method, path, and
body in panic recovery logs using OTel semconv attributes. Improve error
message to direct users to GitHub issues or support.
* feat(instrumentation): add OTel exception semantic convention log handler
Add a loghandler.Wrapper that enriches error log records with OpenTelemetry
exception semantic convention attributes (exception.type, exception.code,
exception.message, exception.stacktrace).
- Add errors.Attr() helper for standardized error logging under "exception" key
- Add exception log handler that replaces raw error attrs with structured group
- Wire exception handler into the instrumentation SDK logger chain
- Remove LogValue() from errors.base as the handler now owns structuring
* refactor: replace "error", err with errors.Attr(err) across codebase
Migrate all slog error logging from ad-hoc "error", err key-value pairs
to the standardized errors.Attr(err) helper, enabling the exception log
handler to enrich these logs with OTel semantic convention attributes.
* refactor: enforce attr-only slog style across codebase
Change sloglint from kv-only to attr-only, requiring all slog calls to
use typed attributes (slog.String, slog.Any, etc.) instead of key-value
pairs. Convert all existing kv-style slog calls in non-excluded paths.
* refactor: tighten slog.Any to specific types and standardize error attrs
- Replace slog.Any with slog.String for string values (action, key, where_clause)
- Replace slog.Any with slog.Uint64 for uint64 values (start, end, step, etc.)
- Replace slog.Any("err", err) with errors.Attr(err) in dispatcher and segment analytics
- Replace slog.Any("error", ctx.Err()) with errors.Attr in factory registry
* fix(instrumentation): use Unwrapb message for exception.message
Use the explicit error message (m) from Unwrapb instead of
foundErr.Error(), which resolves to the inner cause's message
for wrapped errors.
* feat(errors): capture stacktrace at error creation time
Store program counters ([]uintptr) in base errors at creation time
using runtime.Callers, inspired by thanos-io/thanos/pkg/errors. The
exception log handler reads the stacktrace from the error instead of
capturing at log time, showing where the error originated.
* fix(instrumentation): apply default log wrappers uniformly in NewLogger
Move correlation, filtering, and exception wrappers into NewLogger so
all call sites (including CLI loggers in cmd/) get them automatically.
* refactor(instrumentation): remove variadic wrappers from NewLogger
NewLogger no longer accepts arbitrary wrappers. The core wrappers
(correlation, filtering, exception) are hardcoded, preventing callers
from accidentally duplicating behavior.
* refactor: migrate remaining "error", <var> to errors.Attr across legacy paths
Replace all remaining "error", <variable> key-value pairs with
errors.Attr(<variable>) in pkg/query-service/ and ee/query-service/
paths that were missed in the initial migration due to non-standard
variable names (res.Err, filterErr, apiErrorObj.Err, etc).
* refactor(instrumentation): use flat exception.* keys instead of nested group
Use flat keys (exception.type, exception.code, exception.message,
exception.stacktrace) instead of a nested slog.Group in the exception
log handler.
* fix: check for metric type without query range constraint
* revert: revert check for metric type without query range constraint
* chore: move temporality+type fetcher to the case where it is actually used
* fix: don't send absent metrics to query builder
* chore: better package import name
* test: unit test add mock for metadata call (which is expected in the test's scenario)
* revert: revert seeding of absent metrics
* fix: throw a not found err if metric data is missing
* test: unit test add mock for metadata call (which is expected in the test's scenario)
* revert: no need for special err handling in threshold rule
* chore: add last seen info in err message
* test: fix broken dashboard test
* test: integration test for short time range query
* chore: python lint issue
* chore: upgrade prometheus/common to latest available version
* chore: upgrade prometheus/prometheus to latest available version
* chore: easy changes first
* chore: slightly unsure changes
* fix: correct imported version of semconv in sdk.go
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, pass no nil prometheus registry
* chore: upgrade go version in dockerfile to 1.25
* chore: no need for our own alert store callback
* chore: 1.25 bullseye is still an rc so shifting to bookworm
* fix: parallel calls for each query in readmultiple method
* chore: remove unused var
* Sync PagerDuty frontend defaults with Alertmanager v0.31
Applied via @cursor push command
* chore: make ctx the first param
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Use the new `flagger` package to power the following features flags in the codebase:
- [x] `use_span_metrics`
- [x] `kafka_span_eval`
- [x] `interpolation_enabled`