Compare commits

...

5 Commits

Author SHA1 Message Date
Nikhil Soni
c295ef386d chore(agent): merge and compact traces skill into single reference doc
Combines trace-detail-architecture.md and TRACES_MODULE.md into one
concise traces-module.md (196 lines, down from 1119 combined).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 10:56:20 +05:30
Nikhil Soni
bf0394cc28 chore(agent): add clickhouse-query skill, project settings, and update existing skills
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-17 18:27:12 +05:30
Nikhil Soni
fa08ca2fac chore(agent): add skill to code review 2026-02-17 14:08:58 +05:30
Nikhil Soni
08c53fe7e8 docs: add few modules implemtation details
Generated by claude code
2026-01-27 22:33:49 +05:30
Nikhil Soni
c1fac00d2e feat: add claude.md and github commands 2026-01-27 22:33:12 +05:30
11 changed files with 2185 additions and 0 deletions

136
.claude/CLAUDE.md Normal file
View File

@@ -0,0 +1,136 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
SigNoz is an open-source observability platform (APM, logs, metrics, traces) built on OpenTelemetry and ClickHouse. It provides a unified solution for monitoring applications with features including distributed tracing, log management, metrics dashboards, and alerting.
## Build and Development Commands
### Development Environment Setup
```bash
make devenv-up # Start ClickHouse and OTel Collector for local dev
make devenv-clickhouse # Start only ClickHouse
make devenv-signoz-otel-collector # Start only OTel Collector
make devenv-clickhouse-clean # Clean ClickHouse data
```
### Backend (Go)
```bash
make go-run-community # Run community backend server
make go-run-enterprise # Run enterprise backend server
make go-test # Run all Go unit tests
go test -race ./pkg/... # Run tests for specific package
go test -race ./pkg/querier/... # Example: run querier tests
```
### Integration Tests (Python)
```bash
cd tests/integration
uv sync # Install dependencies
make py-test-setup # Start test environment (keep running with --reuse)
make py-test # Run all integration tests
make py-test-teardown # Stop test environment
# Run specific test
uv run pytest --basetemp=./tmp/ -vv --reuse src/<suite>/<file>.py::test_name
```
### Code Quality
```bash
# Go linting (golangci-lint)
golangci-lint run
# Python formatting/linting
make py-fmt # Format with black
make py-lint # Run isort, autoflake, pylint
```
### OpenAPI Generation
```bash
go run cmd/enterprise/*.go generate openapi
```
## Architecture Overview
### Backend Structure
The Go backend follows a **provider pattern** for dependency injection:
- **`pkg/signoz/`** - IoC container that wires all providers together
- **`pkg/modules/`** - Business logic modules (user, organization, dashboard, etc.)
- **`pkg/<provider>/`** - Provider implementations following consistent structure:
- `<name>.go` - Interface definition
- `config.go` - Configuration (implements `factory.Config`)
- `<implname><name>/provider.go` - Implementation
- `<name>test/` - Mock implementations for testing
### Key Packages
- **`pkg/querier/`** - Query engine for telemetry data (logs, traces, metrics)
- **`pkg/telemetrystore/`** - ClickHouse telemetry storage interface
- **`pkg/sqlstore/`** - Relational database (SQLite/PostgreSQL) for metadata
- **`pkg/apiserver/`** - HTTP API server with OpenAPI integration
- **`pkg/alertmanager/`** - Alert management
- **`pkg/authn/`, `pkg/authz/`** - Authentication and authorization
- **`pkg/flagger/`** - Feature flags (OpenFeature-based)
- **`pkg/errors/`** - Structured error handling
### Enterprise vs Community
- **`cmd/community/`** - Community edition entry point
- **`cmd/enterprise/`** - Enterprise edition entry point
- **`ee/`** - Enterprise-only features
## Code Conventions
### Error Handling
Use the custom `pkg/errors` package instead of standard library:
```go
errors.New(typ, code, message) // Instead of errors.New()
errors.Newf(typ, code, message, args...) // Instead of fmt.Errorf()
errors.Wrapf(err, typ, code, msg) // Wrap with context
```
Define domain-specific error codes:
```go
var CodeThingNotFound = errors.MustNewCode("thing_not_found")
```
### HTTP Handlers
Handlers are thin adapters in modules that:
1. Extract auth context from request
2. Decode request body using `binding` package
3. Call module functions
4. Return responses using `render` package
Register routes in `pkg/apiserver/signozapiserver/` with `handler.New()` and `OpenAPIDef`.
### SQL/Database
- Use Bun ORM via `sqlstore.BunDBCtx(ctx)`
- Star schema with `organizations` as central entity
- All tables have `id`, `created_at`, `updated_at`, `org_id` columns
- Write idempotent migrations in `pkg/sqlmigration/`
- No `ON CASCADE` deletes - handle in application logic
### REST Endpoints
- Use plural resource names: `/v1/organizations`, `/v1/users`
- Use `me` for current user/org: `/v1/organizations/me/users`
- Follow RESTful conventions for CRUD operations
### Linting Rules (from .golangci.yml)
- Don't use `errors` package - use `pkg/errors`
- Don't use `zap` logger - use `slog`
- Don't use `fmt.Errorf` or `fmt.Print*`
## Testing
### Unit Tests
- Run with race detector: `go test -race ./...`
- Provider mocks are in `<provider>test/` packages
### Integration Tests
- Located in `tests/integration/`
- Use pytest with testcontainers
- Files prefixed with numbers for execution order (e.g., `01_database.py`)
- Always use `--reuse` flag during development
- Fixtures in `tests/integration/fixtures/`

15
.claude/settings.json Normal file
View File

@@ -0,0 +1,15 @@
{
"permissions": {
"allow": [
"Read",
"Glob",
"Grep",
"Bash(git *)",
"Bash(make *)",
"Bash(cd *)",
"Bash(ls *)",
"Bash(go run *)",
"Bash(yarn run *)"
]
}
}

View File

@@ -0,0 +1,21 @@
---
description: Write optimised ClickHouse queries for SigNoz dashboards (traces, errors, logs)
user_invocable: true
---
# Writing ClickHouse Queries for SigNoz Dashboards
Read [clickhouse-traces-reference.md](./clickhouse-traces-reference.md) for full schema and query reference before writing any query. It covers:
- All table schemas (`distributed_signoz_index_v3`, `distributed_traces_v3_resource`, `distributed_signoz_error_index_v2`, etc.)
- The mandatory resource filter CTE pattern and timestamp bucketing
- Attribute access syntax (standard, indexed, resource)
- Dashboard panel query templates (timeseries, value, table)
- Real-world query examples (span counts, error rates, latency, event extraction)
## Workflow
1. **Understand the ask**: What metric/data does the user want? (e.g., error rate, latency, span count)
2. **Pick the panel type**: Timeseries (time-series chart), Value (single number), or Table (rows).
3. **Build the query** following the mandatory patterns from the reference doc.
4. **Validate** the query uses all required optimizations (resource CTE, ts_bucket_start, indexed columns).

View File

@@ -0,0 +1,460 @@
# ClickHouse Traces Query Reference for SigNoz
Source: https://signoz.io/docs/userguide/writing-clickhouse-traces-query/
All tables live in the `signoz_traces` database.
---
## Table Schemas
### distributed_signoz_index_v3 (Primary Spans Table)
The main table for querying span data. 30+ columns following OpenTelemetry conventions.
```sql
(
`ts_bucket_start` UInt64 CODEC(DoubleDelta, LZ4),
`resource_fingerprint` String CODEC(ZSTD(1)),
`timestamp` DateTime64(9) CODEC(DoubleDelta, LZ4),
`trace_id` FixedString(32) CODEC(ZSTD(1)),
`span_id` String CODEC(ZSTD(1)),
`trace_state` String CODEC(ZSTD(1)),
`parent_span_id` String CODEC(ZSTD(1)),
`flags` UInt32 CODEC(T64, ZSTD(1)),
`name` LowCardinality(String) CODEC(ZSTD(1)),
`kind` Int8 CODEC(T64, ZSTD(1)),
`kind_string` String CODEC(ZSTD(1)),
`duration_nano` UInt64 CODEC(T64, ZSTD(1)),
`status_code` Int16 CODEC(T64, ZSTD(1)),
`status_message` String CODEC(ZSTD(1)),
`status_code_string` String CODEC(ZSTD(1)),
`attributes_string` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
`attributes_number` Map(LowCardinality(String), Float64) CODEC(ZSTD(1)),
`attributes_bool` Map(LowCardinality(String), Bool) CODEC(ZSTD(1)),
`resources_string` Map(LowCardinality(String), String) CODEC(ZSTD(1)), -- deprecated
`resource` JSON(max_dynamic_paths = 100) CODEC(ZSTD(1)),
`events` Array(String) CODEC(ZSTD(2)),
`links` String CODEC(ZSTD(1)),
`response_status_code` LowCardinality(String) CODEC(ZSTD(1)),
`external_http_url` LowCardinality(String) CODEC(ZSTD(1)),
`http_url` LowCardinality(String) CODEC(ZSTD(1)),
`external_http_method` LowCardinality(String) CODEC(ZSTD(1)),
`http_method` LowCardinality(String) CODEC(ZSTD(1)),
`http_host` LowCardinality(String) CODEC(ZSTD(1)),
`db_name` LowCardinality(String) CODEC(ZSTD(1)),
`db_operation` LowCardinality(String) CODEC(ZSTD(1)),
`has_error` Bool CODEC(T64, ZSTD(1)),
`is_remote` LowCardinality(String) CODEC(ZSTD(1)),
-- Pre-indexed "selected" columns (use these instead of map access when available):
`resource_string_service$$name` LowCardinality(String) DEFAULT resources_string['service.name'] CODEC(ZSTD(1)),
`attribute_string_http$$route` LowCardinality(String) DEFAULT attributes_string['http.route'] CODEC(ZSTD(1)),
`attribute_string_messaging$$system` LowCardinality(String) DEFAULT attributes_string['messaging.system'] CODEC(ZSTD(1)),
`attribute_string_messaging$$operation` LowCardinality(String) DEFAULT attributes_string['messaging.operation'] CODEC(ZSTD(1)),
`attribute_string_db$$system` LowCardinality(String) DEFAULT attributes_string['db.system'] CODEC(ZSTD(1)),
`attribute_string_rpc$$system` LowCardinality(String) DEFAULT attributes_string['rpc.system'] CODEC(ZSTD(1)),
`attribute_string_rpc$$service` LowCardinality(String) DEFAULT attributes_string['rpc.service'] CODEC(ZSTD(1)),
`attribute_string_rpc$$method` LowCardinality(String) DEFAULT attributes_string['rpc.method'] CODEC(ZSTD(1)),
`attribute_string_peer$$service` LowCardinality(String) DEFAULT attributes_string['peer.service'] CODEC(ZSTD(1))
)
ORDER BY (ts_bucket_start, resource_fingerprint, has_error, name, timestamp)
```
### distributed_traces_v3_resource (Resource Lookup Table)
Used in the resource filter CTE pattern for efficient filtering by resource attributes.
```sql
(
`labels` String CODEC(ZSTD(5)),
`fingerprint` String CODEC(ZSTD(1)),
`seen_at_ts_bucket_start` Int64 CODEC(Delta(8), ZSTD(1))
)
```
### distributed_signoz_error_index_v2 (Error Events)
```sql
(
`timestamp` DateTime64(9) CODEC(DoubleDelta, LZ4),
`errorID` FixedString(32) CODEC(ZSTD(1)),
`groupID` FixedString(32) CODEC(ZSTD(1)),
`traceID` FixedString(32) CODEC(ZSTD(1)),
`spanID` String CODEC(ZSTD(1)),
`serviceName` LowCardinality(String) CODEC(ZSTD(1)),
`exceptionType` LowCardinality(String) CODEC(ZSTD(1)),
`exceptionMessage` String CODEC(ZSTD(1)),
`exceptionStacktrace` String CODEC(ZSTD(1)),
`exceptionEscaped` Bool CODEC(T64, ZSTD(1)),
`resourceTagsMap` Map(LowCardinality(String), String) CODEC(ZSTD(1)),
INDEX idx_error_id errorID TYPE bloom_filter GRANULARITY 4,
INDEX idx_resourceTagsMapKeys mapKeys(resourceTagsMap) TYPE bloom_filter(0.01) GRANULARITY 64,
INDEX idx_resourceTagsMapValues mapValues(resourceTagsMap) TYPE bloom_filter(0.01) GRANULARITY 64
)
```
### distributed_top_level_operations
```sql
(
`name` LowCardinality(String) CODEC(ZSTD(1)),
`serviceName` LowCardinality(String) CODEC(ZSTD(1))
)
```
### distributed_span_attributes_keys
```sql
(
`tagKey` LowCardinality(String) CODEC(ZSTD(1)),
`tagType` Enum8('tag' = 1, 'resource' = 2) CODEC(ZSTD(1)),
`dataType` Enum8('string' = 1, 'bool' = 2, 'float64' = 3) CODEC(ZSTD(1)),
`isColumn` Bool CODEC(ZSTD(1))
)
```
### distributed_span_attributes
```sql
(
`timestamp` DateTime CODEC(DoubleDelta, ZSTD(1)),
`tagKey` LowCardinality(String) CODEC(ZSTD(1)),
`tagType` Enum8('tag' = 1, 'resource' = 2) CODEC(ZSTD(1)),
`dataType` Enum8('string' = 1, 'bool' = 2, 'float64' = 3) CODEC(ZSTD(1)),
`stringTagValue` String CODEC(ZSTD(1)),
`float64TagValue` Nullable(Float64) CODEC(ZSTD(1)),
`isColumn` Bool CODEC(ZSTD(1))
)
```
---
## Mandatory Optimization Patterns
### 1. Resource Filter CTE
**Always** use a CTE to pre-filter resource fingerprints when filtering by resource attributes (service.name, environment, etc.). This is the single most impactful optimization.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE (simpleJSONExtractString(labels, 'service.name') = 'myservice')
AND seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT ...
FROM signoz_traces.distributed_signoz_index_v3
WHERE resource_fingerprint GLOBAL IN __resource_filter
AND ...
```
- Multiple resource filters: chain with AND in the CTE WHERE clause.
- Use `simpleJSONExtractString(labels, '<key>')` to extract resource attribute values.
### 2. Timestamp Bucketing
**Always** include `ts_bucket_start` filter alongside `timestamp` filter. Data is bucketed in 30-minute (1800-second) intervals.
```sql
WHERE timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}}
AND ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
```
The `- 1800` on the start ensures spans at bucket boundaries are not missed.
### 3. Use Indexed Columns Over Map Access
When a pre-indexed ("selected") column exists, use it instead of map access:
| Instead of | Use |
|---|---|
| `attributes_string['http.route']` | `attribute_string_http$$route` |
| `attributes_string['db.system']` | `attribute_string_db$$system` |
| `attributes_string['rpc.method']` | `attribute_string_rpc$$method` |
| `attributes_string['peer.service']` | `attribute_string_peer$$service` |
| `resources_string['service.name']` | `resource_string_service$$name` |
The naming convention: replace `.` with `$$` in the attribute name and prefix with `attribute_string_`, `attribute_number_`, or `attribute_bool_`.
### 4. Use Pre-extracted Columns
These top-level columns are faster than map access:
- `http_method`, `http_url`, `http_host`
- `db_name`, `db_operation`
- `has_error`, `duration_nano`, `name`, `kind`
- `response_status_code`
---
## Attribute Access Syntax
### Standard (non-indexed) attributes
```sql
attributes_string['http.status_code']
attributes_number['response_time']
attributes_bool['is_error']
```
### Selected (indexed) attributes — direct column names
```sql
attribute_string_http$$route -- for http.route
attribute_number_response$$time -- for response.time
attribute_bool_is$$error -- for is.error
```
### Resource attributes in SELECT / GROUP BY
```sql
resource.service.name::String
resource.environment::String
```
### Resource attributes in WHERE (via CTE)
```sql
simpleJSONExtractString(labels, 'service.name') = 'myservice'
```
### Checking attribute existence
```sql
mapContains(attributes_string, 'http.method')
```
---
## Dashboard Panel Query Templates
### Timeseries Panel
Aggregates data over time intervals for chart visualization.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE (simpleJSONExtractString(labels, 'service.name') = '{{service}}')
AND seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS ts,
toFloat64(count()) AS value
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
GROUP BY ts
ORDER BY ts ASC;
```
### Value Panel
Returns a single aggregated number. Wrap the timeseries query and reduce with `avg()`, `sum()`, `min()`, `max()`, or `any()`.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE (simpleJSONExtractString(labels, 'service.name') = '{{service}}')
AND seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
avg(value) as value,
any(ts) as ts
FROM (
SELECT
toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS ts,
toFloat64(count()) AS value
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
GROUP BY ts
ORDER BY ts ASC
)
```
### Table Panel
Rows grouped by dimensions. Use `now() as ts` instead of a time interval column.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
now() as ts,
resource.service.name::String as `service.name`,
toFloat64(count()) AS value
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp AND
`service.name` IS NOT NULL
GROUP BY `service.name`, ts
ORDER BY value DESC;
```
---
## Query Examples
### Timeseries — Error spans per service per minute
Shows `has_error` filtering, resource attribute in SELECT, and multi-series grouping.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS ts,
resource.service.name::String as `service.name`,
toFloat64(count()) AS value
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
has_error = true AND
`service.name` IS NOT NULL AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
GROUP BY `service.name`, ts
ORDER BY ts ASC;
```
### Value — Average duration of GET requests
Shows the value-panel wrapping pattern (`avg(value)` / `any(ts)`) with a service resource filter.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE (simpleJSONExtractString(labels, 'service.name') = 'api-service')
AND seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
avg(value) as value,
any(ts) as ts FROM (
SELECT
toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS ts,
toFloat64(avg(duration_nano)) AS value
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp AND
http_method = 'GET'
GROUP BY ts
ORDER BY ts ASC
)
```
### Table — Average duration by HTTP method
Shows `now() as ts` pattern, pre-extracted column usage, and non-null filtering.
```sql
WITH __resource_filter AS (
SELECT fingerprint
FROM signoz_traces.distributed_traces_v3_resource
WHERE (simpleJSONExtractString(labels, 'service.name') = 'api-gateway')
AND seen_at_ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp
)
SELECT
now() as ts,
http_method,
toFloat64(avg(duration_nano)) AS avg_duration_nano
FROM signoz_traces.distributed_signoz_index_v3
WHERE
resource_fingerprint GLOBAL IN __resource_filter AND
timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}} AND
ts_bucket_start BETWEEN $start_timestamp - 1800 AND $end_timestamp AND
http_method IS NOT NULL AND http_method != ''
GROUP BY http_method, ts
ORDER BY avg_duration_nano DESC;
```
### Advanced — Extract values from span events
Shows `arrayFilter`/`arrayMap` pattern for querying the `events` JSON array.
```sql
WITH arrayFilter(x -> JSONExtractString(x, 'name')='Getting customer', events) AS filteredEvents
SELECT toStartOfInterval(timestamp, INTERVAL 1 MINUTE) AS interval,
toFloat64(count()) AS count,
arrayJoin(arrayMap(x -> JSONExtractString(JSONExtractString(x, 'attributeMap'), 'customer_id'), filteredEvents)) AS resultArray
FROM signoz_traces.distributed_signoz_index_v3
WHERE not empty(filteredEvents)
AND timestamp > toUnixTimestamp(now() - INTERVAL 30 MINUTE)
AND ts_bucket_start >= toUInt64(toUnixTimestamp(now() - toIntervalMinute(30))) - 1800
GROUP BY (resultArray, interval) order by (resultArray, interval) ASC;
```
### Advanced — Average latency between two specific spans
Shows cross-span latency calculation using `minIf()` and indexed service columns.
```sql
SELECT
interval,
round(avg(time_diff), 2) AS result
FROM
(
SELECT
interval,
traceID,
if(startTime1 != 0, if(startTime2 != 0, (toUnixTimestamp64Nano(startTime2) - toUnixTimestamp64Nano(startTime1)) / 1000000, nan), nan) AS time_diff
FROM
(
SELECT
toStartOfInterval(timestamp, toIntervalMinute(1)) AS interval,
traceID,
minIf(timestamp, if(resource_string_service$$name='driver', if(name = '/driver.DriverService/FindNearest', if((resources_string['component']) = 'gRPC', true, false), false), false)) AS startTime1,
minIf(timestamp, if(resource_string_service$$name='route', if(name = 'HTTP GET /route', true, false), false)) AS startTime2
FROM signoz_traces.distributed_signoz_index_v3
WHERE (timestamp BETWEEN {{.start_datetime}} AND {{.end_datetime}})
AND (ts_bucket_start BETWEEN {{.start_timestamp}} - 1800 AND {{.end_timestamp}})
AND (resource_string_service$$name IN ('driver', 'route'))
GROUP BY (interval, traceID)
ORDER BY (interval, traceID) ASC
)
)
WHERE isNaN(time_diff) = 0
GROUP BY interval
ORDER BY interval ASC;
```
---
## SigNoz Dashboard Variables
These template variables are automatically replaced by SigNoz when the query runs:
| Variable | Description |
|---|---|
| `{{.start_datetime}}` | Start of selected time range (DateTime64) |
| `{{.end_datetime}}` | End of selected time range (DateTime64) |
| `$start_timestamp` | Start as Unix timestamp (seconds) |
| `$end_timestamp` | End as Unix timestamp (seconds) |
---
## Query Optimization Checklist
Before finalizing any query, verify:
- [ ] **Resource filter CTE** is present when filtering by resource attributes (service.name, environment, etc.)
- [ ] **`ts_bucket_start`** filter is included alongside `timestamp` filter, with `- 1800` on start
- [ ] **`GLOBAL IN`** is used (not just `IN`) for the resource fingerprint subquery
- [ ] **Indexed columns** are used over map access where available (e.g., `http_method` over `attributes_string['http.method']`)
- [ ] **Pre-extracted columns** are used where available (`has_error`, `duration_nano`, `http_method`, `db_name`, etc.)
- [ ] **`seen_at_ts_bucket_start`** filter is included in the resource CTE
- [ ] Aggregation results are cast with `toFloat64()` for dashboard compatibility
- [ ] For timeseries: results are ordered by time column ASC
- [ ] For table panels: `now() as ts` is used instead of time intervals
- [ ] For value panels: outer query uses `avg(value)` / `any(ts)` pattern

View File

@@ -0,0 +1,37 @@
---
name: commit
description: Create a conventional commit with staged changes
allowed-tools: Bash(git:*)
---
# Create Conventional Commit
Commit staged changes using conventional commit format: `type(scope): description`
## Types
- `feat:` - New feature
- `fix:` - Bug fix
- `chore:` - Maintenance/refactor/tooling
- `test:` - Tests only
- `docs:` - Documentation
## Process
1. Review staged changes: `git diff --cached`
2. Determine type, optional scope, and description (imperative, <70 chars)
3. Commit using HEREDOC:
```bash
git commit -m "$(cat <<'EOF'
type(scope): description
EOF
)"
```
4. Verify: `git log -1`
## Notes
- Description: imperative mood, lowercase, no period
- Body: explain WHY, not WHAT (code shows what). Keep it concise and brief.
- Do not include co-authored by claude in commit message, we want ownership and accountability to remain with the human contributor.
- Do not automatically add files to stage unless asked to.

View File

@@ -0,0 +1,22 @@
---
description: How to start SigNoz frontend and backend dev servers
---
# Dev Server Setup
Full guide: [development.md](../../docs/contributing/development.md)
## Start Order
1. **Infra**: Ensure clickhouse container is running using `docker ps | grep clickhouse`
2. **Backend**: `make go-run-community` (serves at `localhost:8080`)
3. **Frontend**: `cd frontend && yarn install && yarn dev` (serves at `localhost:3301`)
- Requires `frontend/.env` with `FRONTEND_API_ENDPOINT=http://localhost:8080`
- For git worktrees, frontend/.env can be created using command: `cp frontend/example.env frontend/.env`.
## Verify
- ClickHouse: `curl http://localhost:8123/ping` → "Ok."
- OTel Collector: `curl http://localhost:13133`
- Backend: `curl http://localhost:8080/api/v1/health``{"status":"ok"}`
- Frontend: `http://localhost:3301`

View File

@@ -0,0 +1,55 @@
---
name: raise-pr
description: Create a pull request with auto-filled template. Pass 'commit' to commit staged changes first.
allowed-tools: Bash(gh:*, git:*), Read
argument-hint: [commit?]
---
# Raise Pull Request
Create a PR with auto-filled template from commits after origin/main.
## Arguments
- No argument: Create PR with existing commits
- `commit`: Commit staged changes first, then create PR
## Process
1. **If `$ARGUMENTS` is "commit"**: Review staged changes and commit with descriptive message
- Check for staged changes: `git diff --cached --stat`
- If changes exist:
- Review the changes: `git diff --cached`
- Use commit skill for making the commit, i.e. follow conventional commit practices
- Commit command: `git commit -m "message"`
2. **Analyze commits since origin/main**:
- `git log origin/main..HEAD --pretty=format:"%s%n%b"` - get commit messages
- `git diff origin/main...HEAD --stat` - see changes
3. **Read template**: `.github/pull_request_template.md`
4. **Generate PR**:
- **Title**: Short (<70 chars), from commit messages or main change
- **Body**: Fill template sections based on commits/changes:
- Summary (why/what/approach) - end with "Closes #<issue_number>" if issue number is available from branch name (git branch --show-current)
- Change Type checkboxes
- Bug Context (if applicable)
- Testing Strategy
- Risk Assessment
- Changelog (if user-facing)
- Checklist
5. **Create PR**:
```bash
git push -u origin $(git branch --show-current)
gh pr create --base main --title "..." --body "..."
gh pr view
```
## Notes
- Analyze ALL commits messages from origin/main to HEAD
- Fill template sections based on code analysis
- Leave template sections as they are if you can't determine the content
- Don't add the changes to git stage, only commit or push whatever user has already staged

View File

@@ -0,0 +1,254 @@
---
name: review
description: Review code changes for bugs, performance issues, and SigNoz convention compliance
allowed-tools: Bash(git:*, gh:*), Read, Glob, Grep
---
# Review Command
Perform a thorough code review following SigNoz's coding conventions and contributing guidelines and any potential bug introduced.
## Usage
Invoke this command to review code changes, files, or pull requests with actionable and concise feedback.
## Process
1. **Determine scope**:
- Ask user what to review if not specified:
- Specific files or directories
- Current git diff (staged or unstaged)
- Specific PR number or commit range
- All changes since origin/main
2. **Gather context**:
```bash
# For current changes
git diff --cached # Staged changes
git diff # Unstaged changes
# For commit range
git diff origin/main...HEAD # All changes since main
# for last commit only
git diff HEAD~1..HEAD
# For specific PR
gh pr view <number> --json files,additions,deletions
gh pr diff <number>
```
3. **Read all relevant files thoroughly**:
- Use Read tool for modified files
- Understand the context and purpose of changes
- Check surrounding code for context
4. **Review against SigNoz guidelines**:
- **Frontend**: Check [Frontend Guidelines](../../frontend/CONTRIBUTIONS.md)
- **Backend/Architecture**: Check [CLAUDE.md](../CLAUDE.md) for provider pattern, error handling, SQL, REST, and linting conventions
- **General**: Check [Contributing Guidelines](../../CONTRIBUTING.md)
- **Commits**: Verify [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/)
5. **Verify feature intent**:
- Read the PR description, commit message, or linked issue to understand *what* the change claims to do
- Trace the code path end-to-end to confirm the change actually achieves its stated goal
- Check that the happy path works as described
- Identify any scenarios where the feature silently does nothing or produces wrong results
6. **Review for bug introduction**:
- **Regressions**: Does the change break existing behavior? Check callers of modified functions/interfaces
- **Edge cases**: Empty inputs, nil/undefined values, boundary conditions, concurrent access
- **Error paths**: Are all error cases handled? Can errors be swallowed silently?
- **State management**: Are state transitions correct? Can state become inconsistent?
- **Race conditions**: Shared mutable state, async operations, missing locks or guards
- **Type mismatches**: Unsafe casts, implicit conversions, `any` usage hiding real types
7. **Review for performance implications**:
- **Backend**: N+1 queries, missing indexes, unbounded result sets, large allocations in hot paths, unnecessary DB round-trips
- **Frontend**: Unnecessary re-renders from inline objects/functions as props, missing memoization on expensive computations, large bundle imports that should be lazy-loaded, unthrottled event handlers
- **General**: O(n²) or worse algorithms on potentially large datasets, unnecessary network calls, missing pagination or limits
8. **Provide actionable, concise feedback** in structured format
## Review Checklist
For coding conventions and style, refer to the linked guideline docs. This checklist focuses on **review-specific concerns** that guidelines alone don't catch.
### Correctness & Intent
- [ ] Change achieves what the PR/commit/issue describes
- [ ] Happy path works end-to-end
- [ ] Edge cases handled (empty, nil, boundary, concurrent)
- [ ] Error paths don't swallow failures silently
- [ ] No regressions to existing callers of modified code
### Security
- [ ] No exposed secrets, API keys, credentials
- [ ] No sensitive data in logs
- [ ] Input validation at system boundaries
- [ ] Authentication/authorization checked for new endpoints
- [ ] No SQL injection or XSS risks
### Performance
- [ ] No N+1 queries or unbounded result sets
- [ ] No unnecessary re-renders (inline objects/functions as props, missing memoization)
- [ ] No large imports that should be lazy-loaded
- [ ] No O(n²) on potentially large datasets
- [ ] Pagination/limits present where needed
### Testing
- [ ] New functionality has tests
- [ ] Edge cases and error paths tested
- [ ] Tests are deterministic (no flakiness)
### Git/Commits
- [ ] Commit messages follow `type(scope): description` ([Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/))
- [ ] Commits are atomic and logical
## Output Format
Provide feedback in this structured format:
```markdown
## Code Review
**Scope**: [What was reviewed]
**Overall**: [1-2 sentence summary and general sentiment]
---
### 🚨 Critical Issues (Must Fix)
1. **[Category]** `file:line`
**Problem**: [What's wrong]
**Why**: [Why it matters]
**Fix**: [Specific solution]
```[language]
// Example fix if helpful
```
### ⚠️ Suggestions (Should Consider)
1. **[Category]** `file:line`
**Issue**: [What could be improved]
**Suggestion**: [Concrete improvement]
### ✅ Positive Highlights
- [Good practice observed]
- [Well-implemented feature]
---
**References**:
- [Relevant guideline links]
```
## Review Categories
Use these categories for issues:
- **Bug / Regression**: Logic errors, edge cases, race conditions, broken existing behavior
- **Feature Gap**: Change doesn't fully achieve its stated intent
- **Security Risk**: Authentication, authorization, data exposure, injection
- **Performance Issue**: Inefficient queries, unnecessary re-renders, memory leaks, unbounded data
- **Convention Violation**: Style, patterns, architectural guidelines (link to relevant guideline doc)
- **Code Quality**: Complexity, duplication, naming, type safety
- **Testing**: Missing tests, inadequate coverage, flaky tests
## Example Review
```markdown
## Code Review
**Scope**: Changes in `frontend/src/pages/TraceDetail/` (3 files, 245 additions)
**Overall**: Good implementation of pagination feature. Found 2 critical issues and 3 suggestions.
---
### 🚨 Critical Issues (Must Fix)
1. **Security Risk** `TraceList.tsx:45`
**Problem**: API token exposed in client-side code
**Why**: Security vulnerability - tokens should never be in frontend
**Fix**: Move authentication to backend, use session-based auth
2. **Performance Issue** `TraceList.tsx:89`
**Problem**: Inline function passed as prop causes unnecessary re-renders
**Why**: Violates frontend guideline, degrades performance with large lists
**Fix**:
```typescript
const handleTraceClick = useCallback((traceId: string) => {
navigate(`/trace/${traceId}`);
}, [navigate]);
```
### ⚠️ Suggestions (Should Consider)
1. **Code Quality** `TraceList.tsx:120-180`
**Issue**: Function exceeds 40-line guideline
**Suggestion**: Extract into smaller functions:
- `filterTracesByTimeRange()`
- `aggregateMetrics()`
- `renderChartData()`
2. **Type Safety** `types.ts:23`
**Issue**: Using `any` for trace attributes
**Suggestion**: Define proper interface for TraceAttributes
3. **Convention** `TraceList.tsx:12`
**Issue**: File imports not organized
**Suggestion**: Let simple-import-sort auto-organize (will happen on save)
### ✅ Positive Highlights
- Excellent use of virtualization for large trace lists
- Good error boundary implementation
- Well-structured component hierarchy
- Comprehensive unit tests included
---
**References**:
- [Frontend Guidelines](../../frontend/CONTRIBUTIONS.md)
- [useCallback best practices](https://kentcdodds.com/blog/usememo-and-usecallback)
```
## Tone Guidelines
- **Be respectful**: Focus on code, not the person
- **Be specific**: Always reference exact file:line locations
- **Be concise**: Get to the point, avoid verbosity
- **Be actionable**: Every comment should have clear resolution path
- **Be balanced**: Acknowledge good work alongside issues
- **Be educational**: Explain why something is an issue, link to guidelines
## Priority Levels
1. **Critical (🚨)**: Security, bugs, data corruption, crashes
2. **Important (⚠️)**: Performance, maintainability, convention violations
3. **Nice to have (💡)**: Style preferences, micro-optimizations
## Important Notes
- **Reference specific guidelines** from docs when applicable
- **Provide code examples** for fixes when helpful
- **Ask questions** if code intent is unclear
- **Link to external resources** for educational value
- **Distinguish** must-fix from should-consider
- **Be concise** - reviewers value their time
## Critical Rules
- **NEVER** be vague - always specify file and line number
- **NEVER** just point out problems - suggest solutions
- **NEVER** review without reading the actual code
- **ALWAYS** check against SigNoz's specific guidelines
- **ALWAYS** provide rationale for each comment
- **ALWAYS** be constructive and respectful
## Reference Documents
- [Frontend Guidelines](../../frontend/CONTRIBUTIONS.md) - React, TypeScript, styling
- [Contributing Guidelines](../../CONTRIBUTING.md) - Workflow, commit conventions
- [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) - Commit format
- [CLAUDE.md](../CLAUDE.md) - Project architecture and conventions

View File

@@ -0,0 +1,14 @@
---
description: Architecture context for the traces module (query building, waterfall, flamegraph)
---
# Traces Module
Read [traces-module.md](./traces-module.md) for full context before working on this module. It covers:
- Storage schema (`signoz_index_v3`, `trace_summary`) and gotchas
- API endpoints (Query Range V5, waterfall, flamegraph, funnels)
- Query building system (statement builder, field mapper, trace operators)
- Backend processing pipelines and caching
- Frontend component map, state flow, and API hooks
- Key file index for backend and frontend

View File

@@ -0,0 +1,191 @@
# SigNoz Traces Module — Developer Guide
## Overview
```
App → OTel SDK → OTLP Receiver → [signozspanmetrics, batch] →
ClickHouse Exporter → signoz_traces DB → Query Service (Go) → Frontend (React)
```
**Query Service layers**: HTTP Handlers (`http_handler.go`) → Querier (`querier.go`, orchestration/cache) → Statement Builders (`pkg/telemetrytraces/`) → ClickHouse
---
## Storage Schema
All tables in `signoz_traces` database. Schema DDL: `signoz-otel-collector/cmd/signozschemamigrator/schema_migrator/traces_migrations.go`.
### `distributed_signoz_index_v3` — Primary span storage
- **Engine**: MergeTree (plain — **no deduplication**, use `DISTINCT ON (span_id)`)
- **Key columns**: `ts_bucket_start` (UInt64), `timestamp` (DateTime64(9)), `trace_id` (FixedString(32)), `span_id`, `duration_nano`, `has_error`, `name`, `resource_string_service$$name`, `attributes_string`, `events`, `links`
- **ORDER BY**: `(ts_bucket_start, resource_fingerprint, has_error, name, timestamp)`
- **Partition**: `toDate(timestamp)`
### `distributed_trace_summary` — Pre-aggregated trace metadata
- **Engine**: AggregatingMergeTree. Columns: `trace_id`, `start` (min), `end` (max), `num_spans` (sum)
- **Populated by** `trace_summary_mv` — materialized view on `signoz_index_v3` that triggers per-batch, inserting partial aggregates. ClickHouse merges them asynchronously.
- **CRITICAL**: Always query with `GROUP BY trace_id` (never raw `SELECT *`)
### Other tables
`distributed_tag_attributes_v2` (attribute keys for autocomplete), `distributed_span_attributes_keys` (which attributes exist)
---
## API Endpoints
### 1. Query Range V5 — `POST /api/v5/query_range`
Primary query endpoint for traces (also logs/metrics). Supports query builder queries, trace operators, aggregations, filters, group by. See [QUERY_RANGE_API.md](../../docs/modules/QUERY_RANGE_API.md).
Key files: `pkg/telemetrytraces/statement_builder.go`, `trace_operator_statement_builder.go`, `pkg/querier/trace_operator_query.go`
### 2. Waterfall — `POST /api/v2/traces/waterfall/{traceId}`
Handler: `http_handler.go:1748` → Reader: `clickhouseReader/reader.go:873`
**Request**: `{ "selectedSpanId", "isSelectedSpanIDUnCollapsed", "uncollapsedSpans[]" }`
**Response**: `{ startTimestampMillis, endTimestampMillis, totalSpansCount, totalErrorSpansCount, rootServiceName, rootServiceEntryPoint, serviceNameToTotalDurationMap, spans[], hasMissingSpans, uncollapsedSpans[] }`
**Pipeline**:
1. Query `trace_summary` for time range → query `signoz_index_v3` with `DISTINCT ON (span_id)` and `ts_bucket_start >= start - 1800`
2. Build span tree: map spanID→Span, link parent via CHILD_OF refs, create Missing Span nodes for absent parents
3. Cache (key: `getWaterfallSpansForTraceWithMetadata-{traceID}`, TTL: 5 min, skipped if trace end within flux interval of 2 min from now)
4. `GetSelectedSpans` (`tracedetail/waterfall.go:159`): find path to selectedSpanID, DFS into uncollapsed nodes, compute SubTreeNodeCount, return sliding window of **500 spans** (40% before, 60% after selected)
### 3. Flamegraph — `POST /api/v2/traces/flamegraph/{traceId}`
Handler: `http_handler.go:1781` → Reader: `reader.go:1091`
**Request**: `{ "selectedSpanId" }` **Response**: `{ startTimestampMillis, endTimestampMillis, durationNano, spans[][] }`
Same DB query as waterfall, but uses **BFS** (not DFS) to organize by level. Returns `[][]*FlamegraphSpan` (lighter model, no tagMap). Level sampling when > 100 spans/level: top 5 by latency + 50 timestamp buckets (2 each). Window: **50 levels**.
### 4. Other APIs
- **Trace Fields**: `GET/POST /api/v2/traces/fields` (handlers at `http_handler.go:4912-4921`)
- **Trace Funnels**: CRUD at `/api/v1/trace-funnels/*`, analytics at `/{funnel_id}/analytics/*` (`pkg/modules/tracefunnel/`)
---
## Query Building System
### Query Structure
```go
QueryBuilderQuery[TraceAggregation]{
Signal: SignalTraces,
Filter: &Filter{Expression: "service.name = 'api' AND duration_nano > 1000000"},
Aggregations: []TraceAggregation{{Expression: "count()", Alias: "total"}},
GroupBy: []GroupByKey{{TelemetryFieldKey: {Name: "service.name"}}},
}
```
### SQL Generation (`statement_builder.go`)
1. **Field resolution** via `field_mapper.go` — maps intrinsic (`trace_id`, `duration_nano`), calculated (`http_method`, `has_error`), and attribute fields (`attributes_string[...]`) to CH columns. Example: `"service.name"``"resource_string_service$$name"`
2. **Time optimization** — if `trace_id` in filter, queries `trace_summary` first to narrow range
3. **Filter building** via `condition_builder.go` — supports `=`, `!=`, `IN`, `LIKE`, `ILIKE`, `EXISTS`, `CONTAINS`, comparisons
4. **Build SQL** by request type: `buildListQuery()`, `buildTimeSeriesQuery()`, `buildScalarQuery()`, `buildTraceQuery()`
### Trace Operators (`trace_operator_statement_builder.go`)
Combines multiple trace queries with set operations. Parses expression (e.g., `"A AND B"`) → builds CTE per query via `trace_operator_cte_builder.go` → combines with INTERSECT (AND), UNION (OR), EXCEPT (NOT).
---
## Frontend (Trace Detail)
### State Flow
```
TraceDetailsV2 (pages/TraceDetailV2/TraceDetailV2.tsx)
├── uncollapsedNodes, interestedSpanId, selectedSpan
├── useGetTraceV2 → waterfall API
├── TraceMetadata (totalSpans, errors, duration)
├── TraceFlamegraph (separate API via useGetTraceFlamegraph)
└── TraceWaterfall → Success → TableV3 (virtualized)
```
### Components
| Component | File |
|-----------|------|
| TraceDetailsV2 | `pages/TraceDetailV2/TraceDetailV2.tsx` |
| TraceMetadata | `container/TraceMetadata/TraceMetadata.tsx` |
| TraceWaterfall | `container/TraceWaterfall/TraceWaterfall.tsx` |
| Success (waterfall table) | `container/TraceWaterfall/.../Success/Success.tsx` |
| Filters | `container/TraceWaterfall/.../Filters/Filters.tsx` |
| TraceFlamegraph | `container/PaginatedTraceFlamegraph/PaginatedTraceFlamegraph.tsx` |
| SpanDetailsDrawer | `container/SpanDetailsDrawer/SpanDetailsDrawer.tsx` |
### API Hooks
| Hook | API |
|------|-----|
| `useGetTraceV2` (`hooks/trace/useGetTraceV2.tsx`) | POST waterfall |
| `useGetTraceFlamegraph` (`hooks/trace/useGetTraceFlamegraph.tsx`) | POST flamegraph |
Adapter: `api/trace/getTraceV2.tsx`. Types: `types/api/trace/getTraceV2.ts`.
---
## Known Gotchas
1. **trace_summary**: Always `GROUP BY trace_id` — raw reads return partial unmerged rows
2. **signoz_index_v3 dedup**: Plain MergeTree. Waterfall uses `DISTINCT ON (span_id)`. Flamegraph relies on map-key dedup (keeps last-seen)
3. **Flux interval**: Traces ending within 2 min of now bypass cache → fresh DB query every interaction
4. **SubTreeNodeCount**: Self-inclusive (root count = total tree nodes)
5. **Waterfall pagination**: Max 500 spans per response (sliding window). Frontend virtual-scrolls and re-fetches at edges
---
## Extending the Module
- **New calculated field**: Define in `telemetrytraces/const.go` → map in `field_mapper.go` → optionally update `condition_builder.go`
- **New API endpoint**: Handler in `http_handler.go` → register route → implement in ClickHouseReader or Querier
- **New aggregation**: Update `querybuilder/agg_expr_rewriter.go`
- **New trace operator**: Update `grammar/TraceOperatorGrammar.g4` + `trace_operator_cte_builder.go`
---
## Key File Index
### Backend
| File | Purpose |
|------|---------|
| `pkg/telemetrytraces/statement_builder.go` | Trace SQL generation |
| `pkg/telemetrytraces/field_mapper.go` | Field → CH column mapping |
| `pkg/telemetrytraces/condition_builder.go` | WHERE clause building |
| `pkg/telemetrytraces/trace_operator_statement_builder.go` | Trace operator SQL |
| `pkg/telemetrytraces/trace_operator_cte_builder.go` | Trace operator CTEs |
| `pkg/querier/trace_operator_query.go` | Trace operator execution |
| `pkg/query-service/app/http_handler.go:1748` | Waterfall handler |
| `pkg/query-service/app/http_handler.go:1781` | Flamegraph handler |
| `pkg/query-service/app/clickhouseReader/reader.go:831` | GetSpansForTrace |
| `pkg/query-service/app/clickhouseReader/reader.go:873` | Waterfall logic |
| `pkg/query-service/app/clickhouseReader/reader.go:1091` | Flamegraph logic |
| `pkg/query-service/app/traces/tracedetail/waterfall.go` | DFS traversal, span selection |
| `pkg/query-service/app/traces/tracedetail/flamegraph.go` | BFS traversal, level sampling |
| `pkg/query-service/model/response.go:279` | Span model (waterfall) |
| `pkg/query-service/model/response.go:305` | FlamegraphSpan model |
| `pkg/query-service/model/trace.go` | SpanItemV2, TraceSummary |
| `pkg/query-service/model/cacheable.go` | Cache structures |
### Frontend
| File | Purpose |
|------|---------|
| `pages/TraceDetailV2/TraceDetailV2.tsx` | Page container |
| `container/TraceWaterfall/.../Success/Success.tsx` | Waterfall table |
| `container/PaginatedTraceFlamegraph/PaginatedTraceFlamegraph.tsx` | Flamegraph |
| `hooks/trace/useGetTraceV2.tsx` | Waterfall API hook |
| `hooks/trace/useGetTraceFlamegraph.tsx` | Flamegraph API hook |
| `api/trace/getTraceV2.tsx` | API adapter |
| `types/api/trace/getTraceV2.ts` | TypeScript types |
### Schema DDL
| File | Purpose |
|------|---------|
| `signozschemamigrator/.../traces_migrations.go:10-134` | signoz_index_v3 |
| `signozschemamigrator/.../traces_migrations.go:271-348` | trace_summary + MV |

View File

@@ -0,0 +1,980 @@
# Query Range API (V5) - Developer Guide
This document provides a comprehensive guide to the Query Range API (V5), which is the primary query endpoint for traces, logs, and metrics in SigNoz. It covers architecture, request/response models, code flows, and implementation details.
## Table of Contents
1. [Overview](#overview)
2. [API Endpoint](#api-endpoint)
3. [Request/Response Models](#requestresponse-models)
4. [Query Types](#query-types)
5. [Request Types](#request-types)
6. [Code Flow](#code-flow)
7. [Key Components](#key-components)
8. [Query Execution](#query-execution)
9. [Caching](#caching)
10. [Result Processing](#result-processing)
11. [Performance Considerations](#performance-considerations)
12. [Extending the API](#extending-the-api)
---
## Overview
The Query Range API (V5) is the unified query endpoint for all telemetry signals (traces, logs, metrics) in SigNoz. It provides:
- **Unified Interface**: Single endpoint for all signal types
- **Query Builder**: Visual query builder support
- **Multiple Query Types**: Builder queries, PromQL, ClickHouse SQL, Formulas, Trace Operators
- **Flexible Response Types**: Time series, scalar, raw data, trace-specific
- **Advanced Features**: Aggregations, filters, group by, ordering, pagination
- **Caching**: Intelligent caching for performance
### Key Technologies
- **Backend**: Go (Golang)
- **Storage**: ClickHouse (columnar database)
- **Query Language**: Custom query builder + PromQL + ClickHouse SQL
- **Protocol**: HTTP/REST API
---
## API Endpoint
### Endpoint Details
**URL**: `POST /api/v5/query_range`
**Handler**: `QuerierAPI.QueryRange``querier.QueryRange`
**Location**:
- Handler: `pkg/querier/querier.go:122`
- Route Registration: `pkg/query-service/app/http_handler.go:480`
**Authentication**: Requires ViewAccess permission
**Content-Type**: `application/json`
### Request Flow
```
HTTP Request (POST /api/v5/query_range)
HTTP Handler (QuerierAPI.QueryRange)
Querier.QueryRange (pkg/querier/querier.go)
Query Execution (Statement Builders → ClickHouse)
Result Processing & Merging
HTTP Response (QueryRangeResponse)
```
---
## Request/Response Models
### Request Model
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/req.go`
```go
type QueryRangeRequest struct {
Start uint64 // Start timestamp (milliseconds)
End uint64 // End timestamp (milliseconds)
RequestType RequestType // Response type (TimeSeries, Scalar, Raw, Trace)
Variables map[string]VariableItem // Template variables
CompositeQuery CompositeQuery // Container for queries
NoCache bool // Skip cache flag
}
```
### Composite Query
```go
type CompositeQuery struct {
Queries []QueryEnvelope // Array of queries to execute
}
```
### Query Envelope
```go
type QueryEnvelope struct {
Type QueryType // Query type (Builder, PromQL, ClickHouseSQL, Formula, TraceOperator)
Spec any // Query specification (type-specific)
}
```
### Response Model
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/req.go`
```go
type QueryRangeResponse struct {
Type RequestType // Response type
Data QueryData // Query results
Meta ExecStats // Execution statistics
Warning *QueryWarnData // Warnings (if any)
QBEvent *QBEvent // Query builder event metadata
}
type QueryData struct {
Results []any // Array of result objects (type depends on RequestType)
}
type ExecStats struct {
RowsScanned uint64 // Total rows scanned
BytesScanned uint64 // Total bytes scanned
DurationMS uint64 // Query duration in milliseconds
StepIntervals map[string]uint64 // Step intervals per query
}
```
---
## Query Types
The API supports multiple query types, each with its own specification format.
### 1. Builder Query (`QueryTypeBuilder`)
Visual query builder queries. Supports traces, logs, and metrics.
**Spec Type**: `QueryBuilderQuery[T]` where T is:
- `TraceAggregation` for traces
- `LogAggregation` for logs
- `MetricAggregation` for metrics
**Example**:
```go
QueryBuilderQuery[TraceAggregation] {
Name: "query_name",
Signal: SignalTraces,
Filter: &Filter {
Expression: "service.name = 'api' AND duration_nano > 1000000",
},
Aggregations: []TraceAggregation {
{Expression: "count()", Alias: "total"},
{Expression: "avg(duration_nano)", Alias: "avg_duration"},
},
GroupBy: []GroupByKey {...},
Order: []OrderBy {...},
Limit: 100,
}
```
**Key Files**:
- Traces: `pkg/telemetrytraces/statement_builder.go`
- Logs: `pkg/telemetrylogs/statement_builder.go`
- Metrics: `pkg/telemetrymetrics/statement_builder.go`
### 2. PromQL Query (`QueryTypePromQL`)
Prometheus Query Language queries for metrics.
**Spec Type**: `PromQuery`
**Example**:
```go
PromQuery {
Query: "rate(http_requests_total[5m])",
Step: Step{Duration: time.Minute},
}
```
**Key Files**: `pkg/querier/promql_query.go`
### 3. ClickHouse SQL Query (`QueryTypeClickHouseSQL`)
Direct ClickHouse SQL queries.
**Spec Type**: `ClickHouseQuery`
**Example**:
```go
ClickHouseQuery {
Query: "SELECT count() FROM signoz_traces.distributed_signoz_index_v3 WHERE ...",
}
```
**Key Files**: `pkg/querier/ch_sql_query.go`
### 4. Formula Query (`QueryTypeFormula`)
Mathematical formulas combining other queries.
**Spec Type**: `QueryBuilderFormula`
**Example**:
```go
QueryBuilderFormula {
Expression: "A / B * 100", // A and B are query names
}
```
**Key Files**: `pkg/querier/formula_query.go`
### 5. Trace Operator Query (`QueryTypeTraceOperator`)
Set operations on trace queries (AND, OR, NOT).
**Spec Type**: `QueryBuilderTraceOperator`
**Example**:
```go
QueryBuilderTraceOperator {
Expression: "A AND B", // A and B are query names
Filter: &Filter {...},
}
```
**Key Files**:
- `pkg/telemetrytraces/trace_operator_statement_builder.go`
- `pkg/querier/trace_operator_query.go`
---
## Request Types
The `RequestType` determines the format of the response data.
### 1. `RequestTypeTimeSeries`
Returns time series data for charts.
**Response Format**: `TimeSeriesData`
```go
type TimeSeriesData struct {
QueryName string
Aggregations []AggregationBucket
}
type AggregationBucket struct {
Index int
Series []TimeSeries
Alias string
Meta AggregationMeta
}
type TimeSeries struct {
Labels map[string]string
Values []TimeSeriesValue
}
type TimeSeriesValue struct {
Timestamp int64
Value float64
}
```
**Use Case**: Line charts, bar charts, area charts
### 2. `RequestTypeScalar`
Returns a single scalar value.
**Response Format**: `ScalarData`
```go
type ScalarData struct {
QueryName string
Data []ScalarValue
}
type ScalarValue struct {
Timestamp int64
Value float64
}
```
**Use Case**: Single value displays, stat panels
### 3. `RequestTypeRaw`
Returns raw data rows.
**Response Format**: `RawData`
```go
type RawData struct {
QueryName string
Columns []string
Rows []RawDataRow
}
type RawDataRow struct {
Timestamp time.Time
Data map[string]any
}
```
**Use Case**: Tables, logs viewer, trace lists
### 4. `RequestTypeTrace`
Returns trace-specific data structure.
**Response Format**: Trace-specific format (see traces documentation)
**Use Case**: Trace-specific visualizations
---
## Code Flow
### Complete Request Flow
```
1. HTTP Request
POST /api/v5/query_range
Body: QueryRangeRequest JSON
2. HTTP Handler
QuerierAPI.QueryRange (pkg/querier/querier.go)
- Validates request
- Extracts organization ID from auth context
3. Querier.QueryRange (pkg/querier/querier.go:122)
- Validates QueryRangeRequest
- Processes each query in CompositeQuery.Queries
- Identifies dependencies (e.g., trace operators, formulas)
- Calculates step intervals
- Fetches metric temporality if needed
4. Query Creation
For each QueryEnvelope:
a. Builder Query:
- newBuilderQuery() creates builderQuery instance
- Selects appropriate statement builder based on signal:
* Traces → traceStmtBuilder
* Logs → logStmtBuilder
* Metrics → metricStmtBuilder or meterStmtBuilder
b. PromQL Query:
- newPromqlQuery() creates promqlQuery instance
- Uses Prometheus engine
c. ClickHouse SQL Query:
- newchSQLQuery() creates chSQLQuery instance
- Direct SQL execution
d. Formula Query:
- newFormulaQuery() creates formulaQuery instance
- References other queries by name
e. Trace Operator Query:
- newTraceOperatorQuery() creates traceOperatorQuery instance
- Uses traceOperatorStmtBuilder
5. Statement Building (for Builder queries)
StatementBuilder.Build()
- Resolves field keys from metadata store
- Builds SQL based on request type:
* RequestTypeRaw → buildListQuery()
* RequestTypeTimeSeries → buildTimeSeriesQuery()
* RequestTypeScalar → buildScalarQuery()
* RequestTypeTrace → buildTraceQuery()
- Returns SQL statement with arguments
6. Query Execution
Query.Execute()
- Executes SQL/query against ClickHouse or Prometheus
- Processes results into response format
- Returns Result with data and statistics
7. Caching (if applicable)
- Checks bucket cache for time series queries
- Executes queries for missing time ranges
- Merges cached and fresh results
8. Result Processing
querier.run()
- Executes all queries (with dependency resolution)
- Collects results and warnings
- Merges results from multiple queries
9. Post-Processing
postProcessResults()
- Applies formulas if present
- Handles variable substitution
- Formats results for response
10. HTTP Response
- Returns QueryRangeResponse with results
- Includes execution statistics
- Includes warnings if any
```
### Key Decision Points
1. **Query Type Selection**: Based on `QueryEnvelope.Type`
2. **Signal Selection**: For builder queries, based on `Signal` field
3. **Request Type Handling**: Different SQL generation for different request types
4. **Caching Strategy**: Only for time series queries with valid fingerprints
5. **Dependency Resolution**: Trace operators and formulas resolve dependencies first
---
## Key Components
### 1. Querier
**Location**: `pkg/querier/querier.go`
**Purpose**: Orchestrates query execution, caching, and result merging
**Key Methods**:
- `QueryRange()`: Main entry point for query execution
- `run()`: Executes queries and merges results
- `executeWithCache()`: Handles caching logic
- `mergeResults()`: Merges cached and fresh results
- `postProcessResults()`: Applies formulas and variable substitution
**Key Features**:
- Query orchestration across multiple query types
- Intelligent caching with bucket-based strategy
- Result merging from multiple queries
- Formula evaluation
- Time range optimization
- Step interval calculation and validation
### 2. Statement Builder Interface
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/`
**Purpose**: Converts query builder specifications into executable queries
**Interface**:
```go
type StatementBuilder[T any] interface {
Build(
ctx context.Context,
start uint64,
end uint64,
requestType RequestType,
query QueryBuilderQuery[T],
variables map[string]VariableItem,
) (*Statement, error)
}
```
**Implementations**:
- `traceQueryStatementBuilder` - Traces (`pkg/telemetrytraces/statement_builder.go`)
- `logQueryStatementBuilder` - Logs (`pkg/telemetrylogs/statement_builder.go`)
- `metricQueryStatementBuilder` - Metrics (`pkg/telemetrymetrics/statement_builder.go`)
**Key Features**:
- Field resolution via metadata store
- SQL generation for different request types
- Filter, aggregation, group by, ordering support
- Time range optimization
### 3. Query Interface
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/`
**Purpose**: Represents an executable query
**Interface**:
```go
type Query interface {
Execute(ctx context.Context) (*Result, error)
Fingerprint() string // For caching
Window() (uint64, uint64) // Time range
}
```
**Implementations**:
- `builderQuery[T]` - Builder queries (`pkg/querier/builder_query.go`)
- `promqlQuery` - PromQL queries (`pkg/querier/promql_query.go`)
- `chSQLQuery` - ClickHouse SQL queries (`pkg/querier/ch_sql_query.go`)
- `formulaQuery` - Formula queries (`pkg/querier/formula_query.go`)
- `traceOperatorQuery` - Trace operator queries (`pkg/querier/trace_operator_query.go`)
### 4. Telemetry Store
**Location**: `pkg/telemetrystore/`
**Purpose**: Abstraction layer for ClickHouse database access
**Key Methods**:
- `Query()`: Execute SQL query
- `QueryRow()`: Execute query returning single row
- `Select()`: Execute query returning multiple rows
**Implementation**: `clickhouseTelemetryStore` (`pkg/telemetrystore/clickhousetelemetrystore/`)
### 5. Metadata Store
**Location**: `pkg/types/telemetrytypes/`
**Purpose**: Provides metadata about available fields, keys, and attributes
**Key Methods**:
- `GetKeysMulti()`: Get field keys for multiple selectors
- `FetchTemporalityMulti()`: Get metric temporality information
**Implementation**: `telemetryMetadataStore` (`pkg/telemetrymetadata/`)
### 6. Bucket Cache
**Location**: `pkg/querier/`
**Purpose**: Caches query results by time buckets for performance
**Key Methods**:
- `GetMissRanges()`: Get time ranges not in cache
- `Put()`: Store query result in cache
**Features**:
- Bucket-based caching (aligned to step intervals)
- Automatic cache invalidation
- Parallel query execution for missing ranges
---
## Query Execution
### Builder Query Execution
**Location**: `pkg/querier/builder_query.go`
**Process**:
1. Statement builder generates SQL
2. SQL executed against ClickHouse via TelemetryStore
3. Results processed based on RequestType:
- TimeSeries: Grouped by time buckets and labels
- Scalar: Single value extraction
- Raw: Row-by-row processing
4. Statistics collected (rows scanned, bytes scanned, duration)
### PromQL Query Execution
**Location**: `pkg/querier/promql_query.go`
**Process**:
1. Query parsed by Prometheus engine
2. Executed against Prometheus-compatible data
3. Results converted to QueryRangeResponse format
### ClickHouse SQL Query Execution
**Location**: `pkg/querier/ch_sql_query.go`
**Process**:
1. SQL query executed directly
2. Results processed based on RequestType
3. Variable substitution applied
### Formula Query Execution
**Location**: `pkg/querier/formula_query.go`
**Process**:
1. Referenced queries executed first
2. Formula expression evaluated using govaluate
3. Results computed from query results
### Trace Operator Query Execution
**Location**: `pkg/querier/trace_operator_query.go`
**Process**:
1. Expression parsed to find dependencies
2. Referenced queries executed
3. Set operations applied (INTERSECT, UNION, EXCEPT)
4. Results combined
---
## Caching
### Caching Strategy
**Location**: `pkg/querier/querier.go:642`
**When Caching Applies**:
- Time series queries only
- Queries with valid fingerprints
- `NoCache` flag not set
**How It Works**:
1. Query fingerprint generated (includes query structure, filters, time range)
2. Cache checked for existing results
3. Missing time ranges identified
4. Queries executed only for missing ranges (parallel execution)
5. Fresh results merged with cached results
6. Merged result stored in cache
### Cache Key Generation
**Location**: `pkg/querier/builder_query.go:52`
The fingerprint includes:
- Signal type
- Source type
- Step interval
- Aggregations
- Filters
- Group by fields
- Time range (for cache key, not fingerprint)
### Cache Benefits
- **Performance**: Avoids re-executing identical queries
- **Efficiency**: Only queries missing time ranges
- **Parallelism**: Multiple missing ranges queried in parallel
---
## Result Processing
### Result Merging
**Location**: `pkg/querier/querier.go:795`
**Process**:
1. Results from multiple queries collected
2. For time series: Series merged by labels
3. For raw data: Rows combined
4. Statistics aggregated (rows scanned, bytes scanned, duration)
### Formula Evaluation
**Location**: `pkg/querier/formula_query.go`
**Process**:
1. Formula expression parsed
2. Referenced query results retrieved
3. Expression evaluated using govaluate library
4. Result computed and formatted
### Variable Substitution
**Location**: `pkg/querier/querier.go`
**Process**:
1. Variables extracted from request
2. Variable values substituted in queries
3. Applied to filters, aggregations, and other query parts
---
## Performance Considerations
### Query Optimization
1. **Time Range Optimization**:
- For trace queries with `trace_id` filter, query `trace_summary` first to narrow time range
- Use appropriate time ranges to limit data scanned
2. **Step Interval Calculation**:
- Automatic step interval calculation based on time range
- Minimum step interval enforcement
- Warnings for suboptimal intervals
3. **Index Usage**:
- Queries use time bucket columns (`ts_bucket_start`) for efficient filtering
- Proper filter placement for index utilization
4. **Limit Enforcement**:
- Raw data queries should include limits
- Pagination support via offset/cursor
### Best Practices
1. **Use Query Builder**: Prefer query builder over raw SQL for better optimization
2. **Limit Time Ranges**: Always specify reasonable time ranges
3. **Use Aggregations**: For large datasets, use aggregations instead of raw data
4. **Cache Awareness**: Be mindful of cache TTLs when testing
5. **Parallel Queries**: Multiple independent queries execute in parallel
6. **Step Intervals**: Let system calculate optimal step intervals
### Monitoring
Execution statistics are included in response:
- `RowsScanned`: Total rows scanned
- `BytesScanned`: Total bytes scanned
- `DurationMS`: Query execution time
- `StepIntervals`: Step intervals per query
---
## Extending the API
### Adding a New Query Type
1. **Define Query Type** (`pkg/types/querybuildertypes/querybuildertypesv5/query.go`):
```go
const (
QueryTypeMyNewType QueryType = "my_new_type"
)
```
2. **Define Query Spec**:
```go
type MyNewQuerySpec struct {
Name string
// ... your fields
}
```
3. **Update QueryEnvelope Unmarshaling** (`pkg/types/querybuildertypes/querybuildertypesv5/query.go`):
```go
case QueryTypeMyNewType:
var spec MyNewQuerySpec
if err := UnmarshalJSONWithContext(shadow.Spec, &spec, "my new query spec"); err != nil {
return wrapUnmarshalError(err, "invalid my new query spec: %v", err)
}
q.Spec = spec
```
4. **Implement Query Interface** (`pkg/querier/my_new_query.go`):
```go
type myNewQuery struct {
spec MyNewQuerySpec
// ... other fields
}
func (q *myNewQuery) Execute(ctx context.Context) (*qbtypes.Result, error) {
// Implementation
}
func (q *myNewQuery) Fingerprint() string {
// Generate fingerprint for caching
}
func (q *myNewQuery) Window() (uint64, uint64) {
// Return time range
}
```
5. **Update Querier** (`pkg/querier/querier.go`):
```go
case QueryTypeMyNewType:
myQuery, ok := query.Spec.(MyNewQuerySpec)
if !ok {
return nil, errors.NewInvalidInputf(...)
}
queries[myQuery.Name] = newMyNewQuery(myQuery, ...)
```
### Adding a New Request Type
1. **Define Request Type** (`pkg/types/querybuildertypes/querybuildertypesv5/req.go`):
```go
const (
RequestTypeMyNewType RequestType = "my_new_type"
)
```
2. **Update Statement Builders**: Add handling in `Build()` method
3. **Update Query Execution**: Add result processing for new type
4. **Update Response Models**: Add response data structure
### Adding a New Aggregation Function
1. **Update Aggregation Rewriter** (`pkg/querybuilder/agg_expr_rewriter.go`):
```go
func (r *aggExprRewriter) RewriteAggregation(expr string) (string, error) {
if strings.HasPrefix(expr, "my_function(") {
// Parse arguments
// Return ClickHouse SQL expression
return "myClickHouseFunction(...)", nil
}
// ... existing functions
}
```
2. **Update Documentation**: Document the new function
---
## Common Patterns
### Pattern 1: Simple Time Series Query
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "A",
Signal: telemetrytypes.SignalMetrics,
Aggregations: []qbtypes.MetricAggregation{
{Expression: "sum(rate)", Alias: "total"},
},
StepInterval: qbtypes.Step{Duration: time.Minute},
},
},
},
},
}
```
### Pattern 2: Query with Filter and Group By
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.TraceAggregation]{
Name: "A",
Signal: telemetrytypes.SignalTraces,
Filter: &qbtypes.Filter{
Expression: "service.name = 'api' AND duration_nano > 1000000",
},
Aggregations: []qbtypes.TraceAggregation{
{Expression: "count()", Alias: "total"},
},
GroupBy: []qbtypes.GroupByKey{
{TelemetryFieldKey: telemetrytypes.TelemetryFieldKey{
Name: "service.name",
FieldContext: telemetrytypes.FieldContextResource,
}},
},
},
},
},
},
}
```
### Pattern 3: Formula Query
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "A",
// ... query A definition
},
},
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "B",
// ... query B definition
},
},
{
Type: qbtypes.QueryTypeFormula,
Spec: qbtypes.QueryBuilderFormula{
Name: "C",
Expression: "A / B * 100",
},
},
},
},
}
```
---
## Testing
### Unit Tests
- `pkg/querier/querier_test.go` - Querier tests
- `pkg/querier/builder_query_test.go` - Builder query tests
- `pkg/querier/formula_query_test.go` - Formula query tests
### Integration Tests
- `tests/integration/` - End-to-end API tests
### Running Tests
```bash
# Run all querier tests
go test ./pkg/querier/...
# Run with verbose output
go test -v ./pkg/querier/...
# Run specific test
go test -v ./pkg/querier/ -run TestQueryRange
```
---
## Debugging
### Enable Debug Logging
```go
// In querier.go
q.logger.DebugContext(ctx, "Executing query",
"query", queryName,
"start", start,
"end", end)
```
### Common Issues
1. **Query Not Found**: Check query name matches in CompositeQuery
2. **SQL Errors**: Check generated SQL in logs, verify ClickHouse syntax
3. **Performance**: Check execution statistics, optimize time ranges
4. **Cache Issues**: Set `NoCache: true` to bypass cache
5. **Formula Errors**: Check formula expression syntax and referenced query names
---
## References
### Key Files
- `pkg/querier/querier.go` - Main query orchestration
- `pkg/querier/builder_query.go` - Builder query execution
- `pkg/types/querybuildertypes/querybuildertypesv5/` - Request/response models
- `pkg/telemetrystore/` - ClickHouse interface
- `pkg/telemetrymetadata/` - Metadata store
### Signal-Specific Documentation
- [Traces Module](./TRACES_MODULE.md) - Trace-specific details
- Logs module documentation (when available)
- Metrics module documentation (when available)
### Related Documentation
- [ClickHouse Documentation](https://clickhouse.com/docs)
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
---
## Contributing
When contributing to the Query Range API:
1. **Follow Existing Patterns**: Match the style of existing query types
2. **Add Tests**: Include unit tests for new functionality
3. **Update Documentation**: Update this doc for significant changes
4. **Consider Performance**: Optimize queries and use caching appropriately
5. **Handle Errors**: Provide meaningful error messages
For questions or help, reach out to the maintainers or open an issue.