* feat(audit): add telemetry audit query infrastructure
Add pkg/telemetryaudit/ with tables, field mapper, condition builder,
and statement builder for querying audit logs from signoz_audit database.
Add SourceAudit to source enum and integrate audit key resolution
into the metadata store.
* chore: address review comments
Comment out SourceAudit from Enum() until frontend is ready.
Use actual audit table constants in metadata test helpers.
* fix(audit): align field mapper with actual audit DDL schema
Remove resources_string (not in audit table DDL).
Add event_name as intrinsic column.
Resource context resolves only through the resource JSON column.
* feat(audit): add audit field value autocomplete support
Wire distributed_tag_attributes_v2 for signoz_audit into the
metadata store. Add getAuditFieldValues() and route SignalLogs +
SourceAudit to it in GetFieldValues().
* test(audit): add statement builder tests
Cover all three request types (list, time series, scalar) with
audit-specific query patterns: materialized column filters, AND/OR
conditions, limit CTEs, and group-by expressions.
* refactor(audit): inline field key map into test file
Remove test_data.go and inline the audit field key map directly
into statement_builder_test.go with a compact helper function.
* style(audit): move column map to const.go, use sqlbuilder.As in metadata
Move logsV2Columns from field_mapper.go to const.go to colocate all
column definitions. Switch getAuditKeys() to use sb.As() instead of
raw string formatting. Fix FieldContext alignment.
* fix(audit): align table names with schema migration
Migration uses logs/distributed_logs (not logs_v2/distributed_logs_v2).
Rename LogsV2TableName to LogsTableName and LogsV2LocalTableName to
LogsLocalTableName to match the actual signoz_audit DDL.
* feat(audit): add integration test fixture for audit logs
AuditLog fixture inserts into all 5 signoz_audit tables matching
the schema migration DDL: distributed_logs (no resources_string,
has event_name), distributed_logs_resource, distributed_tag_attributes_v2,
distributed_logs_attribute_keys, distributed_logs_resource_keys.
* fix(audit): rename tag_attributes_v2 to tag_attributes
Migration uses tag_attributes/distributed_tag_attributes (no _v2
suffix). Rename constants and update all references including the
integration test fixture.
* feat(audit): wire audit statement builder into querier
Add auditStmtBuilder to querier struct and route LogAggregation
queries with source=audit to it in all three dispatch locations
(main query, live tail, shiftedQuery). Create and wire the full
audit query stack in signozquerier provider.
* test(audit): add integration tests for audit log querying
Cover the documented query patterns: list all events, filter by
principal ID, filter by outcome, filter by resource name+ID,
filter by principal type, scalar count for alerting, and
isolation test ensuring audit data doesn't leak into regular logs.
* fix(audit): revert sb.As in getAuditKeys, fix fixture column_names
Revert getAuditKeys to use raw SQL strings instead of sb.As() which
incorrectly treated string literals as column references. Add explicit
column_names to all ClickHouse insert calls in the audit fixture.
* fix(audit): remove debug assertion from integration test
* feat(audit): internalize resource filter in audit statement builder
Build the resource filter internally pointing at
signoz_audit.distributed_logs_resource. Add LogsResourceTableName
constant. Remove resourceFilterStmtBuilder from constructor params.
Update test expectations to use the audit resource table.
* fix(audit): rename resource.name to resource.kind, move to resource attributes
Align with schema change from SigNoz/signoz#10826:
- signoz.audit.resource.name renamed to signoz.audit.resource.kind
- resource.kind and resource.id moved from event attributes to OTel
Resource attributes (resource JSON column)
- Materialized columns reduced from 7 to 5 (resource.kind and
resource.id no longer materialized)
* refactor(audit): use pytest.mark.parametrize for filter integration tests
Consolidate filter test functions into a single parametrized test.
6/8 tests passing; resource kind+ID filter and scalar count need
further investigation (resource filter JSON key extraction with
dotted keys, scalar response format).
* fix(audit): add source to resource filter for correct metadata routing
Add source param to telemetryresourcefilter.New so the resource
filter's key selectors include Source when calling GetKeysMulti.
Without this, audit resource keys route to signoz_logs metadata
tables instead of signoz_audit. Fix scalar test to use table
response format (columns+data, not rows).
* refactor(audit): reuse querier fixtures in integration tests
Add source param to BuilderQuery and build_scalar_query in the
querier fixture. Replace custom _build_audit_query and
_build_audit_ts_query helpers with BuilderQuery and
build_scalar_query from the shared fixtures.
* refactor(audit): remove wrapper helpers, inline make_query_request calls
Remove _query_audit_raw and _query_audit_scalar helpers. Use
make_query_request, BuilderQuery, and build_scalar_query directly.
Compute time window at test execution time via _time_window() to
avoid stale module-level timestamps.
* refactor(audit): inline _time_window into test functions
* style(audit): use snake_case for pytest parametrize IDs
* refactor(audit): inline DEFAULT_ORDER using build_order_by
Use build_order_by from querier fixtures instead of OrderBy/
TelemetryFieldKey dataclasses. Allow BuilderQuery.order to accept
plain dicts alongside OrderBy objects.
* refactor(audit): inline all data setup, use distinct scenarios per test
Remove _insert_standard_audit_events helper. Each test now owns its
data: list_all uses alert-rule/saved-view/user resource types,
scalar_count uses multiple failures from different principals (count=2),
leak test uses a single organization event. Parametrized filter tests
keep the original 5-event dataset.
* fix(audit): remove silent empty-string guards in metadata store
Remove guards that silently returned nil/empty when audit DB params
were empty. All call sites now pass real constants, so misconfiguration
should fail loudly rather than produce silent empty results.
* style(audit): remove module docstring from integration test
* style: formatting fix in tables file
* style: formatting fix in tables file
* fix: add auditStmtBuilder nil param to querier_test.go
* fix: fix fmt
* fix: show warning for non-existent cost meter metrics
* chore: lint fix by removing unused list
* chore: py fmt add new line
* fix: missing metric check on type instead of temporality
* test: fix unit tests by mocking type data
* test: unit tests
* revert: revert changes from meter branch
* revert: revert changes from meter branch
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
* refactor: move resourcefilter to pkg/telemetryresourcefilter
Move pkg/querybuilder/resourcefilter to pkg/telemetryresourcefilter
to align with the existing telemetry package naming convention
(telemetrylogs, telemetrytraces, telemetrymetrics, telemetrymeter).
The resource filter is a statement builder, not a query builder utility.
* refactor: internalize resource filter construction in statement builders
Each telemetry statement builder (logs, traces) now creates its own
resource filter internally instead of receiving it as an injected
dependency. This makes it impossible to wire the wrong resource table
and simplifies the provider.
Delete telemetryresourcefilter/tables.go — each telemetry package now
owns its resource table constant (LogsResourceV2TableName in
telemetrylogs, TracesResourceV3TableName in telemetrytraces).
* refactor: create field mapper and condition builder inside resource filter New
Remove fieldMapper and conditionBuilder params from
telemetryresourcefilter.New — they are always the same
(NewFieldMapper + NewConditionBuilder) so create them internally.
* fix: warning instead of error for dormant metrics in query range API
* fix: add missing else
* fix: keep track of present aggregations
* fix: note present aggregation after type is set
* test: integration test fix and new test
* chore: lint errors
---------
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
* feat(serviceaccount): integrate service account
* feat(serviceaccount): integrate service account with better types
* feat(serviceaccount): fix lint and testing changes
* feat(serviceaccount): update integration tests
* feat(serviceaccount): fix formatting
* feat(serviceaccount): fix openapi spec
* feat(serviceaccount): update txlock to immediate to avoid busy snapshot errors
* feat(serviceaccount): add restrictions for factor_api_key
* feat(serviceaccount): add restrictions for factor_api_key
* feat: enabled service account and deprecated API Keys (#10715)
* feat: enabled service account and deprecated API Keys
* feat: deprecated API Keys
* feat: service account spec updates and role management changes
* feat: updated the error component for roles management
* feat: updated test case
* feat: updated the error component and added retries
* feat: refactored code and added retry to happend 3 times total
* feat: fixed feedbacks and added test case
* feat: refactored code and removed retry
* feat: updated the test cases
---------
Co-authored-by: SagarRajput-7 <162284829+SagarRajput-7@users.noreply.github.com>
* fix(querier): return proper HTTP status for PromQL timeout errors
PromQL queries hitting the context deadline were incorrectly returning
400 Bad Request with "invalid_input" because enhancePromQLError
unconditionally wrapped all errors as TypeInvalidInput. Extract
tryEnhancePromQLExecError to properly classify timeout, cancellation,
and storage errors before falling through to parse error handling.
Also make the PromQL engine timeout configurable via prometheus.timeout
config (default 2m) instead of hardcoding it.
* chore: refactor files
* fix(prometheus): validate timeout config and fix test setups
Add validation in prometheus.Config to reject zero timeout. Update all
test files to explicitly set Timeout: 2 * time.Minute in prometheus.Config
literals to avoid immediate query timeouts.
* feat(middleware): add panic recovery middleware with TypeFatal error type
Add a global HTTP recovery middleware that catches panics, logs them
with OTel exception semantic conventions via errors.Attr, and returns
a safe user-facing error response. Introduce TypeFatal/CodeFatal for
unrecoverable failures and WithStacktrace to attach pre-formatted
stack traces to errors. Remove redundant per-handler panic recovery
blocks in querier APIs.
* style(errors): keep WithStacktrace call on same line in test
* fix(middleware): replace fmt.Errorf with errors.New in recovery test
* feat(middleware): add request context to panic recovery logs
Capture request body before handler runs and include method, path, and
body in panic recovery logs using OTel semconv attributes. Improve error
message to direct users to GitHub issues or support.
* feat(instrumentation): add OTel exception semantic convention log handler
Add a loghandler.Wrapper that enriches error log records with OpenTelemetry
exception semantic convention attributes (exception.type, exception.code,
exception.message, exception.stacktrace).
- Add errors.Attr() helper for standardized error logging under "exception" key
- Add exception log handler that replaces raw error attrs with structured group
- Wire exception handler into the instrumentation SDK logger chain
- Remove LogValue() from errors.base as the handler now owns structuring
* refactor: replace "error", err with errors.Attr(err) across codebase
Migrate all slog error logging from ad-hoc "error", err key-value pairs
to the standardized errors.Attr(err) helper, enabling the exception log
handler to enrich these logs with OTel semantic convention attributes.
* refactor: enforce attr-only slog style across codebase
Change sloglint from kv-only to attr-only, requiring all slog calls to
use typed attributes (slog.String, slog.Any, etc.) instead of key-value
pairs. Convert all existing kv-style slog calls in non-excluded paths.
* refactor: tighten slog.Any to specific types and standardize error attrs
- Replace slog.Any with slog.String for string values (action, key, where_clause)
- Replace slog.Any with slog.Uint64 for uint64 values (start, end, step, etc.)
- Replace slog.Any("err", err) with errors.Attr(err) in dispatcher and segment analytics
- Replace slog.Any("error", ctx.Err()) with errors.Attr in factory registry
* fix(instrumentation): use Unwrapb message for exception.message
Use the explicit error message (m) from Unwrapb instead of
foundErr.Error(), which resolves to the inner cause's message
for wrapped errors.
* feat(errors): capture stacktrace at error creation time
Store program counters ([]uintptr) in base errors at creation time
using runtime.Callers, inspired by thanos-io/thanos/pkg/errors. The
exception log handler reads the stacktrace from the error instead of
capturing at log time, showing where the error originated.
* fix(instrumentation): apply default log wrappers uniformly in NewLogger
Move correlation, filtering, and exception wrappers into NewLogger so
all call sites (including CLI loggers in cmd/) get them automatically.
* refactor(instrumentation): remove variadic wrappers from NewLogger
NewLogger no longer accepts arbitrary wrappers. The core wrappers
(correlation, filtering, exception) are hardcoded, preventing callers
from accidentally duplicating behavior.
* refactor: migrate remaining "error", <var> to errors.Attr across legacy paths
Replace all remaining "error", <variable> key-value pairs with
errors.Attr(<variable>) in pkg/query-service/ and ee/query-service/
paths that were missed in the initial migration due to non-standard
variable names (res.Err, filterErr, apiErrorObj.Err, etc).
* refactor(instrumentation): use flat exception.* keys instead of nested group
Use flat keys (exception.type, exception.code, exception.message,
exception.stacktrace) instead of a nested slog.Group in the exception
log handler.
* fix: check for metric type without query range constraint
* revert: revert check for metric type without query range constraint
* chore: move temporality+type fetcher to the case where it is actually used
* fix: don't send absent metrics to query builder
* chore: better package import name
* test: unit test add mock for metadata call (which is expected in the test's scenario)
* revert: revert seeding of absent metrics
* fix: throw a not found err if metric data is missing
* test: unit test add mock for metadata call (which is expected in the test's scenario)
* revert: no need for special err handling in threshold rule
* chore: add last seen info in err message
* test: fix broken dashboard test
* test: integration test for short time range query
* chore: python lint issue
* chore: upgrade prometheus/common to latest available version
* chore: upgrade prometheus/prometheus to latest available version
* chore: easy changes first
* chore: slightly unsure changes
* fix: correct imported version of semconv in sdk.go
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, just matched expected and actual nothing else
* test: ut fix, pass no nil prometheus registry
* chore: upgrade go version in dockerfile to 1.25
* chore: no need for our own alert store callback
* chore: 1.25 bullseye is still an rc so shifting to bookworm
* fix: parallel calls for each query in readmultiple method
* chore: remove unused var
* Sync PagerDuty frontend defaults with Alertmanager v0.31
Applied via @cursor push command
* chore: make ctx the first param
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Use the new `flagger` package to power the following features flags in the codebase:
- [x] `use_span_metrics`
- [x] `kafka_span_eval`
- [x] `interpolation_enabled`
This PR fulfills the requirements of #9069 by:
- Adding a golangci-lint directive (forbidigo) to disallow all fmt.Errorf usages.
- Replacing existing fmt.Errorf instances with structured errors from github.com/SigNoz/signoz/pkg/errors for consistent error classification and lint compliance.
- Verified lint and build integrity.
* feat(access-control): embed openfga in signoz
* feat(authz): rename access control to authz
* feat(authz): fix codeowners and go mod tidy
* feat(authz): fix lint
* feat(authz): update go version and move convertor to instrumentation
* feat(authz): some more lint issues
* feat(authz): some more lint issues
* feat(authz): some more lint issues
* feat(authz): fix more lint issues
* feat(authz): make logger converter interface
* feat(telemetry/meter): added base setup for telemetry meter signal
* feat(telemetry/meter): added metadata setup for meter
* feat(telemetry/meter): fix stmnt builder tests
* feat(telemetry/meter): test query range API fixes
* feat(telemetry/meter): improve error messages
* feat(telemetrymeter): step interval improvements
* feat(telemetrymeter): metadata changes and aggregate attribute changes
* feat(telemetrymeter): metadata changes and aggregate attribute changes
* feat(telemetrymeter): deprecate the signal and use aggregation instead
* feat(telemetrymeter): deprecate the signal and use aggregation instead
* feat(telemetrymeter): deprecate the signal and use aggregation instead
* feat(telemetrymeter): cleanup the types
* feat(telemetrymeter): introduce source for query
* feat(telemetrymeter): better naming for source in metadata
* feat(telemetrymeter): added quick filters for meter explorer
* feat(telemetrymeter): incorporate the new changes to stmnt builder
* feat(telemetrymeter): add the statement builder for the ranged cache queries
* feat(telemetrymeter): use meter aggregate keys
* feat(telemetrymeter): use meter aggregate keys
* feat(telemetrymeter): remove meter from complete bools
* feat(telemetrymeter): remove meter from complete bools
* feat(telemetrymeter): update the quick filters to use meter
## 📄 Summary
To reliably migrate the alerts and dashboards, we need access to the telemetrystore to fetch some metadata and while doing migration, I need to log some stuff to fix stuff later.
Key changes:
- Modified the migration to include telemetrystore and a logging provider (open to using a standard logger instead)
- To avoid the previous issues with imported dashboards failing during migration, I've ensured that imported JSON files are automatically transformed when migration is active
- Implemented detailed logic to handle dashboard migration cleanly and prevent unnecessary errors
- Separated the core migration logic from SQL migration code, as users from the dot metrics migration requested shareable code snippets for local migrations. This modular approach allows others to easily reuse the migration functionality.
Known: I didn't register the migration yet in this PR, and will not merge this yet, so please review with that in mid.
For the requestType: Trace, we don't care about the timestamp in the rawRow.
- Handling Zero timestamp values in the rawData response
- simplify RawRow `map[string]*any` to `map[string]any` and eliminate unnecessary pointer indirection.
## 📄 Summary
- Fix the order by for the time series result
- Add the statement builder for trace query (was supposed to be replaced with new development but that never happened, so we continue the old table)
- Removed `pkg/types/telemetrytypes/virtualfield.go`, not used currently anywhere but causing circular import. Will re-introduce later.