Compare commits

...

2 Commits

Author SHA1 Message Date
Nikhil Soni
08c53fe7e8 docs: add few modules implemtation details
Generated by claude code
2026-01-27 22:33:49 +05:30
Nikhil Soni
c1fac00d2e feat: add claude.md and github commands 2026-01-27 22:33:12 +05:30
7 changed files with 2516 additions and 0 deletions

136
.claude/CLAUDE.md Normal file
View File

@@ -0,0 +1,136 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
SigNoz is an open-source observability platform (APM, logs, metrics, traces) built on OpenTelemetry and ClickHouse. It provides a unified solution for monitoring applications with features including distributed tracing, log management, metrics dashboards, and alerting.
## Build and Development Commands
### Development Environment Setup
```bash
make devenv-up # Start ClickHouse and OTel Collector for local dev
make devenv-clickhouse # Start only ClickHouse
make devenv-signoz-otel-collector # Start only OTel Collector
make devenv-clickhouse-clean # Clean ClickHouse data
```
### Backend (Go)
```bash
make go-run-community # Run community backend server
make go-run-enterprise # Run enterprise backend server
make go-test # Run all Go unit tests
go test -race ./pkg/... # Run tests for specific package
go test -race ./pkg/querier/... # Example: run querier tests
```
### Integration Tests (Python)
```bash
cd tests/integration
uv sync # Install dependencies
make py-test-setup # Start test environment (keep running with --reuse)
make py-test # Run all integration tests
make py-test-teardown # Stop test environment
# Run specific test
uv run pytest --basetemp=./tmp/ -vv --reuse src/<suite>/<file>.py::test_name
```
### Code Quality
```bash
# Go linting (golangci-lint)
golangci-lint run
# Python formatting/linting
make py-fmt # Format with black
make py-lint # Run isort, autoflake, pylint
```
### OpenAPI Generation
```bash
go run cmd/enterprise/*.go generate openapi
```
## Architecture Overview
### Backend Structure
The Go backend follows a **provider pattern** for dependency injection:
- **`pkg/signoz/`** - IoC container that wires all providers together
- **`pkg/modules/`** - Business logic modules (user, organization, dashboard, etc.)
- **`pkg/<provider>/`** - Provider implementations following consistent structure:
- `<name>.go` - Interface definition
- `config.go` - Configuration (implements `factory.Config`)
- `<implname><name>/provider.go` - Implementation
- `<name>test/` - Mock implementations for testing
### Key Packages
- **`pkg/querier/`** - Query engine for telemetry data (logs, traces, metrics)
- **`pkg/telemetrystore/`** - ClickHouse telemetry storage interface
- **`pkg/sqlstore/`** - Relational database (SQLite/PostgreSQL) for metadata
- **`pkg/apiserver/`** - HTTP API server with OpenAPI integration
- **`pkg/alertmanager/`** - Alert management
- **`pkg/authn/`, `pkg/authz/`** - Authentication and authorization
- **`pkg/flagger/`** - Feature flags (OpenFeature-based)
- **`pkg/errors/`** - Structured error handling
### Enterprise vs Community
- **`cmd/community/`** - Community edition entry point
- **`cmd/enterprise/`** - Enterprise edition entry point
- **`ee/`** - Enterprise-only features
## Code Conventions
### Error Handling
Use the custom `pkg/errors` package instead of standard library:
```go
errors.New(typ, code, message) // Instead of errors.New()
errors.Newf(typ, code, message, args...) // Instead of fmt.Errorf()
errors.Wrapf(err, typ, code, msg) // Wrap with context
```
Define domain-specific error codes:
```go
var CodeThingNotFound = errors.MustNewCode("thing_not_found")
```
### HTTP Handlers
Handlers are thin adapters in modules that:
1. Extract auth context from request
2. Decode request body using `binding` package
3. Call module functions
4. Return responses using `render` package
Register routes in `pkg/apiserver/signozapiserver/` with `handler.New()` and `OpenAPIDef`.
### SQL/Database
- Use Bun ORM via `sqlstore.BunDBCtx(ctx)`
- Star schema with `organizations` as central entity
- All tables have `id`, `created_at`, `updated_at`, `org_id` columns
- Write idempotent migrations in `pkg/sqlmigration/`
- No `ON CASCADE` deletes - handle in application logic
### REST Endpoints
- Use plural resource names: `/v1/organizations`, `/v1/users`
- Use `me` for current user/org: `/v1/organizations/me/users`
- Follow RESTful conventions for CRUD operations
### Linting Rules (from .golangci.yml)
- Don't use `errors` package - use `pkg/errors`
- Don't use `zap` logger - use `slog`
- Don't use `fmt.Errorf` or `fmt.Print*`
## Testing
### Unit Tests
- Run with race detector: `go test -race ./...`
- Provider mocks are in `<provider>test/` packages
### Integration Tests
- Located in `tests/integration/`
- Use pytest with testcontainers
- Files prefixed with numbers for execution order (e.g., `01_database.py`)
- Always use `--reuse` flag during development
- Fixtures in `tests/integration/fixtures/`

View File

@@ -0,0 +1,36 @@
---
name: commit
description: Create a conventional commit with staged changes
disable-model-invocation: true
allowed-tools: Bash(git:*)
---
# Create Conventional Commit
Commit staged changes using conventional commit format: `type(scope): description`
## Types
- `feat:` - New feature
- `fix:` - Bug fix
- `chore:` - Maintenance/refactor/tooling
- `test:` - Tests only
- `docs:` - Documentation
## Process
1. Review staged changes: `git diff --cached`
2. Determine type, optional scope, and description (imperative, <70 chars)
3. Commit using HEREDOC:
```bash
git commit -m "$(cat <<'EOF'
type(scope): description
EOF
)"
```
4. Verify: `git log -1`
## Notes
- Description: imperative mood, lowercase, no period
- Body: explain WHY, not WHAT (code shows what)

View File

@@ -0,0 +1,55 @@
---
name: raise-pr
description: Create a pull request with auto-filled template. Pass 'commit' to commit staged changes first.
disable-model-invocation: true
allowed-tools: Bash(gh:*, git:*), Read
argument-hint: [commit?]
---
# Raise Pull Request
Create a PR with auto-filled template from commits after origin/main.
## Arguments
- No argument: Create PR with existing commits
- `commit`: Commit staged changes first, then create PR
## Process
1. **If `$ARGUMENTS` is "commit"**: Review staged changes and commit with descriptive message
- Check for staged changes: `git diff --cached --stat`
- If changes exist:
- Review the changes: `git diff --cached`
- Create a short and clear commit message based on the changes
- Commit command: `git commit -m "message"`
2. **Analyze commits since origin/main**:
- `git log origin/main..HEAD --pretty=format:"%s%n%b"` - get commit messages
- `git diff origin/main...HEAD --stat` - see changes
3. **Read template**: `.github/pull_request_template.md`
4. **Generate PR**:
- **Title**: Short (<70 chars), from commit messages or main change
- **Body**: Fill template sections based on commits/changes:
- Summary (why/what/approach) - end with "Closes #<issue_number>" if issue number is available from branch name (git branch --show-current)
- Change Type checkboxes
- Bug Context (if applicable)
- Testing Strategy
- Risk Assessment
- Changelog (if user-facing)
- Checklist
5. **Create PR**:
```bash
git push -u origin $(git branch --show-current)
gh pr create --base main --title "..." --body "..."
gh pr view
```
## Notes
- Analyze ALL commits messages from origin/main to HEAD
- Fill template sections based on code analysis
- Leave the sections of PR template as it is if you can't determine

View File

@@ -0,0 +1,292 @@
# External API Monitoring - Developer Guide
## Overview
External API Monitoring tracks outbound HTTP calls from your services to external APIs. It groups spans by domain (e.g., `api.example.com`) and displays metrics like endpoint count, request rate, error rate, latency, and last seen time.
**Key Requirement**: Spans must have `kind_string = 'Client'` and either `http.url`/`url.full` AND `net.peer.name`/`server.address` attributes.
---
## Architecture Flow
```
Frontend (DomainList)
→ useListOverview hook
→ POST /api/v1/third-party-apis/overview/list
→ getDomainList handler
→ BuildDomainList (7 queries)
→ QueryRange (ClickHouse)
→ Post-processing (merge semconv, filter IPs)
→ formatDataForTable
→ UI Display
```
---
## Key APIs
### 1. Domain List API
**Endpoint**: `POST /api/v1/third-party-apis/overview/list`
**Request**:
```json
{
"start": 1699123456789, // Unix timestamp (ms)
"end": 1699127056789,
"show_ip": false, // Filter IP addresses
"filter": {
"expression": "kind_string = 'Client' AND service.name = 'api'"
}
}
```
**Response**: Table with columns:
- `net.peer.name` (domain name)
- `endpoints` (count_distinct with fallback: http.url or url.full)
- `rps` (rate())
- `error_rate` (formula: error/total_span * 100)
- `p99` (p99(duration_nano))
- `lastseen` (max(timestamp))
**Handler**: `pkg/query-service/app/http_handler.go::getDomainList()`
---
### 2. Domain Info API
**Endpoint**: `POST /api/v1/third-party-apis/overview/domain`
**Request**: Same as Domain List, but includes `domain` field
**Response**: Endpoint-level metrics for a specific domain
**Handler**: `pkg/query-service/app/http_handler.go::getDomainInfo()`
---
## Query Building
### Location
`pkg/modules/thirdpartyapi/translator.go`
### BuildDomainList() - Creates 7 Sub-queries
1. **endpoints**: `count_distinct(if(http.url exists, http.url, url.full))` - Unique endpoint count (handles both semconv attributes)
2. **lastseen**: `max(timestamp)` - Last access time
3. **rps**: `rate()` - Requests per second
4. **error**: `count() WHERE has_error = true` - Error count
5. **total_span**: `count()` - Total spans (for error rate)
6. **p99**: `p99(duration_nano)` - 99th percentile latency
7. **error_rate**: Formula `(error/total_span)*100`
### Base Filter
```go
"(http.url EXISTS OR url.full EXISTS) AND kind_string = 'Client'"
```
### GroupBy
- Groups by `server.address` + `net.peer.name` (dual semconv support)
---
## Key Files
### Frontend
| File | Purpose |
|------|---------|
| `frontend/src/container/ApiMonitoring/Explorer/Domains/DomainList.tsx` | Main list view component |
| `frontend/src/container/ApiMonitoring/Explorer/Domains/DomainDetails/DomainDetails.tsx` | Domain details drawer |
| `frontend/src/hooks/thirdPartyApis/useListOverview.ts` | Data fetching hook |
| `frontend/src/api/thirdPartyApis/listOverview.ts` | API client |
| `frontend/src/container/ApiMonitoring/utils.tsx` | Utilities (formatting, query building) |
### Backend
| File | Purpose |
|------|---------|
| `pkg/query-service/app/http_handler.go` | API handlers (`getDomainList`, `getDomainInfo`) |
| `pkg/modules/thirdpartyapi/translator.go` | Query builder & response processing |
| `pkg/types/thirdpartyapitypes/thirdpartyapi.go` | Request/response types |
---
## Data Tables
### Primary Table
- **Table**: `signoz_traces.distributed_signoz_index_v3`
- **Key Columns**:
- `kind_string` - Filter for `'Client'` spans
- `duration_nano` - For latency calculations
- `has_error` - For error rate
- `timestamp` - For last seen
- `attributes_string` - Map containing `http.url`, `net.peer.name`, etc.
- `resources_string` - Map containing `server.address`, `service.name`, etc.
### Attribute Access
```sql
-- Check existence
mapContains(attributes_string, 'http.url') = 1
-- Get value
attributes_string['http.url']
-- Materialized (if exists)
attribute_string_http$$url
```
---
## Post-Processing
### 1. MergeSemconvColumns()
- Merges `server.address` and `net.peer.name` into single column
- Location: `pkg/modules/thirdpartyapi/translator.go:117`
### 2. FilterIntermediateColumns()
- Removes intermediate formula columns from response
- Location: `pkg/modules/thirdpartyapi/translator.go:70`
### 3. FilterResponse()
- Filters out IP addresses if `show_ip = false`
- Uses `net.ParseIP()` to detect IPs
- Location: `pkg/modules/thirdpartyapi/translator.go:214`
---
## Required Attributes
### For Domain Grouping
- `net.peer.name` OR `server.address` (required)
### For Filtering
- `http.url` OR `url.full` (required)
- `kind_string = 'Client'` (required)
### Not Required
- `http.target` - Not used in external API monitoring
### Known Bug
The `buildEndpointsQuery()` uses `count_distinct(http.url)` but filter allows `url.full`. If spans only have `url.full`, they pass filter but don't contribute to endpoint count.
**Fix Needed**: Update aggregation to handle both attributes:
```go
// Current (buggy)
{Expression: "count_distinct(http.url)"}
// Should be
{Expression: "count_distinct(coalesce(http.url, url.full))"}
```
---
## Frontend Data Flow
### 1. Domain List View
```
DomainList component
→ useListOverview({ start, end, show_ip, filter })
→ listOverview API call
→ formatDataForTable(response)
→ Table display
```
### 2. Domain Details View
```
User clicks domain
→ DomainDetails drawer opens
→ Multiple queries:
- DomainMetrics (overview cards)
- AllEndpoints (endpoint table)
- TopErrors (error table)
- EndPointDetails (when endpoint selected)
```
### 3. Data Formatting
- `formatDataForTable()` - Converts API response to table format
- Handles `n/a` values, converts nanoseconds to milliseconds
- Maps column names to display fields
---
## Query Examples
### Domain List Query
```sql
SELECT
multiIf(
mapContains(attributes_string, 'server.address'),
attributes_string['server.address'],
mapContains(attributes_string, 'net.peer.name'),
attributes_string['net.peer.name'],
NULL
) AS domain,
count_distinct(attributes_string['http.url']) AS endpoints,
rate() AS rps,
p99(duration_nano) AS p99,
max(timestamp) AS lastseen
FROM signoz_traces.distributed_signoz_index_v3
WHERE
(mapContains(attributes_string, 'http.url') = 1
OR mapContains(attributes_string, 'url.full') = 1)
AND kind_string = 'Client'
AND timestamp >= ? AND timestamp < ?
GROUP BY domain
```
---
## Testing
### Key Test Files
- `frontend/src/container/ApiMonitoring/__tests__/AllEndpointsWidgetV5Migration.test.tsx`
- `frontend/src/container/ApiMonitoring/__tests__/EndpointDropdownV5Migration.test.tsx`
- `pkg/modules/thirdpartyapi/translator_test.go`
### Test Scenarios
1. Domain filtering with both semconv attributes
2. URL handling (http.url vs url.full)
3. IP address filtering
4. Error rate calculation
5. Empty state handling
---
## Common Issues
### Empty State
**Symptom**: No domains shown despite data existing
**Causes**:
1. Missing `net.peer.name` or `server.address`
2. Missing `http.url` or `url.full`
3. Spans not marked as `kind_string = 'Client'`
4. Bug: Only `url.full` present but query uses `count_distinct(http.url)`
### Performance
- Queries use `ts_bucket_start` for time partitioning
- Resource filtering uses separate `distributed_traces_v3_resource` table
- Materialized columns improve performance for common attributes
---
## Quick Start Checklist
- [ ] Understand trace table schema (`signoz_index_v3`)
- [ ] Review `BuildDomainList()` in `translator.go`
- [ ] Check `getDomainList()` handler in `http_handler.go`
- [ ] Review frontend `DomainList.tsx` component
- [ ] Understand semconv attribute mapping (legacy vs current)
- [ ] Test with spans that have required attributes
- [ ] Review post-processing functions (merge, filter)
---
## References
- **Trace Schema**: `pkg/telemetrytraces/field_mapper.go`
- **Query Builder**: `pkg/telemetrytraces/statement_builder.go`
- **API Routes**: `pkg/query-service/app/http_handler.go:2157`
- **Constants**: `pkg/modules/thirdpartyapi/translator.go:14-20`

View File

@@ -0,0 +1,980 @@
# Query Range API (V5) - Developer Guide
This document provides a comprehensive guide to the Query Range API (V5), which is the primary query endpoint for traces, logs, and metrics in SigNoz. It covers architecture, request/response models, code flows, and implementation details.
## Table of Contents
1. [Overview](#overview)
2. [API Endpoint](#api-endpoint)
3. [Request/Response Models](#requestresponse-models)
4. [Query Types](#query-types)
5. [Request Types](#request-types)
6. [Code Flow](#code-flow)
7. [Key Components](#key-components)
8. [Query Execution](#query-execution)
9. [Caching](#caching)
10. [Result Processing](#result-processing)
11. [Performance Considerations](#performance-considerations)
12. [Extending the API](#extending-the-api)
---
## Overview
The Query Range API (V5) is the unified query endpoint for all telemetry signals (traces, logs, metrics) in SigNoz. It provides:
- **Unified Interface**: Single endpoint for all signal types
- **Query Builder**: Visual query builder support
- **Multiple Query Types**: Builder queries, PromQL, ClickHouse SQL, Formulas, Trace Operators
- **Flexible Response Types**: Time series, scalar, raw data, trace-specific
- **Advanced Features**: Aggregations, filters, group by, ordering, pagination
- **Caching**: Intelligent caching for performance
### Key Technologies
- **Backend**: Go (Golang)
- **Storage**: ClickHouse (columnar database)
- **Query Language**: Custom query builder + PromQL + ClickHouse SQL
- **Protocol**: HTTP/REST API
---
## API Endpoint
### Endpoint Details
**URL**: `POST /api/v5/query_range`
**Handler**: `QuerierAPI.QueryRange``querier.QueryRange`
**Location**:
- Handler: `pkg/querier/querier.go:122`
- Route Registration: `pkg/query-service/app/http_handler.go:480`
**Authentication**: Requires ViewAccess permission
**Content-Type**: `application/json`
### Request Flow
```
HTTP Request (POST /api/v5/query_range)
HTTP Handler (QuerierAPI.QueryRange)
Querier.QueryRange (pkg/querier/querier.go)
Query Execution (Statement Builders → ClickHouse)
Result Processing & Merging
HTTP Response (QueryRangeResponse)
```
---
## Request/Response Models
### Request Model
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/req.go`
```go
type QueryRangeRequest struct {
Start uint64 // Start timestamp (milliseconds)
End uint64 // End timestamp (milliseconds)
RequestType RequestType // Response type (TimeSeries, Scalar, Raw, Trace)
Variables map[string]VariableItem // Template variables
CompositeQuery CompositeQuery // Container for queries
NoCache bool // Skip cache flag
}
```
### Composite Query
```go
type CompositeQuery struct {
Queries []QueryEnvelope // Array of queries to execute
}
```
### Query Envelope
```go
type QueryEnvelope struct {
Type QueryType // Query type (Builder, PromQL, ClickHouseSQL, Formula, TraceOperator)
Spec any // Query specification (type-specific)
}
```
### Response Model
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/req.go`
```go
type QueryRangeResponse struct {
Type RequestType // Response type
Data QueryData // Query results
Meta ExecStats // Execution statistics
Warning *QueryWarnData // Warnings (if any)
QBEvent *QBEvent // Query builder event metadata
}
type QueryData struct {
Results []any // Array of result objects (type depends on RequestType)
}
type ExecStats struct {
RowsScanned uint64 // Total rows scanned
BytesScanned uint64 // Total bytes scanned
DurationMS uint64 // Query duration in milliseconds
StepIntervals map[string]uint64 // Step intervals per query
}
```
---
## Query Types
The API supports multiple query types, each with its own specification format.
### 1. Builder Query (`QueryTypeBuilder`)
Visual query builder queries. Supports traces, logs, and metrics.
**Spec Type**: `QueryBuilderQuery[T]` where T is:
- `TraceAggregation` for traces
- `LogAggregation` for logs
- `MetricAggregation` for metrics
**Example**:
```go
QueryBuilderQuery[TraceAggregation] {
Name: "query_name",
Signal: SignalTraces,
Filter: &Filter {
Expression: "service.name = 'api' AND duration_nano > 1000000",
},
Aggregations: []TraceAggregation {
{Expression: "count()", Alias: "total"},
{Expression: "avg(duration_nano)", Alias: "avg_duration"},
},
GroupBy: []GroupByKey {...},
Order: []OrderBy {...},
Limit: 100,
}
```
**Key Files**:
- Traces: `pkg/telemetrytraces/statement_builder.go`
- Logs: `pkg/telemetrylogs/statement_builder.go`
- Metrics: `pkg/telemetrymetrics/statement_builder.go`
### 2. PromQL Query (`QueryTypePromQL`)
Prometheus Query Language queries for metrics.
**Spec Type**: `PromQuery`
**Example**:
```go
PromQuery {
Query: "rate(http_requests_total[5m])",
Step: Step{Duration: time.Minute},
}
```
**Key Files**: `pkg/querier/promql_query.go`
### 3. ClickHouse SQL Query (`QueryTypeClickHouseSQL`)
Direct ClickHouse SQL queries.
**Spec Type**: `ClickHouseQuery`
**Example**:
```go
ClickHouseQuery {
Query: "SELECT count() FROM signoz_traces.distributed_signoz_index_v3 WHERE ...",
}
```
**Key Files**: `pkg/querier/ch_sql_query.go`
### 4. Formula Query (`QueryTypeFormula`)
Mathematical formulas combining other queries.
**Spec Type**: `QueryBuilderFormula`
**Example**:
```go
QueryBuilderFormula {
Expression: "A / B * 100", // A and B are query names
}
```
**Key Files**: `pkg/querier/formula_query.go`
### 5. Trace Operator Query (`QueryTypeTraceOperator`)
Set operations on trace queries (AND, OR, NOT).
**Spec Type**: `QueryBuilderTraceOperator`
**Example**:
```go
QueryBuilderTraceOperator {
Expression: "A AND B", // A and B are query names
Filter: &Filter {...},
}
```
**Key Files**:
- `pkg/telemetrytraces/trace_operator_statement_builder.go`
- `pkg/querier/trace_operator_query.go`
---
## Request Types
The `RequestType` determines the format of the response data.
### 1. `RequestTypeTimeSeries`
Returns time series data for charts.
**Response Format**: `TimeSeriesData`
```go
type TimeSeriesData struct {
QueryName string
Aggregations []AggregationBucket
}
type AggregationBucket struct {
Index int
Series []TimeSeries
Alias string
Meta AggregationMeta
}
type TimeSeries struct {
Labels map[string]string
Values []TimeSeriesValue
}
type TimeSeriesValue struct {
Timestamp int64
Value float64
}
```
**Use Case**: Line charts, bar charts, area charts
### 2. `RequestTypeScalar`
Returns a single scalar value.
**Response Format**: `ScalarData`
```go
type ScalarData struct {
QueryName string
Data []ScalarValue
}
type ScalarValue struct {
Timestamp int64
Value float64
}
```
**Use Case**: Single value displays, stat panels
### 3. `RequestTypeRaw`
Returns raw data rows.
**Response Format**: `RawData`
```go
type RawData struct {
QueryName string
Columns []string
Rows []RawDataRow
}
type RawDataRow struct {
Timestamp time.Time
Data map[string]any
}
```
**Use Case**: Tables, logs viewer, trace lists
### 4. `RequestTypeTrace`
Returns trace-specific data structure.
**Response Format**: Trace-specific format (see traces documentation)
**Use Case**: Trace-specific visualizations
---
## Code Flow
### Complete Request Flow
```
1. HTTP Request
POST /api/v5/query_range
Body: QueryRangeRequest JSON
2. HTTP Handler
QuerierAPI.QueryRange (pkg/querier/querier.go)
- Validates request
- Extracts organization ID from auth context
3. Querier.QueryRange (pkg/querier/querier.go:122)
- Validates QueryRangeRequest
- Processes each query in CompositeQuery.Queries
- Identifies dependencies (e.g., trace operators, formulas)
- Calculates step intervals
- Fetches metric temporality if needed
4. Query Creation
For each QueryEnvelope:
a. Builder Query:
- newBuilderQuery() creates builderQuery instance
- Selects appropriate statement builder based on signal:
* Traces → traceStmtBuilder
* Logs → logStmtBuilder
* Metrics → metricStmtBuilder or meterStmtBuilder
b. PromQL Query:
- newPromqlQuery() creates promqlQuery instance
- Uses Prometheus engine
c. ClickHouse SQL Query:
- newchSQLQuery() creates chSQLQuery instance
- Direct SQL execution
d. Formula Query:
- newFormulaQuery() creates formulaQuery instance
- References other queries by name
e. Trace Operator Query:
- newTraceOperatorQuery() creates traceOperatorQuery instance
- Uses traceOperatorStmtBuilder
5. Statement Building (for Builder queries)
StatementBuilder.Build()
- Resolves field keys from metadata store
- Builds SQL based on request type:
* RequestTypeRaw → buildListQuery()
* RequestTypeTimeSeries → buildTimeSeriesQuery()
* RequestTypeScalar → buildScalarQuery()
* RequestTypeTrace → buildTraceQuery()
- Returns SQL statement with arguments
6. Query Execution
Query.Execute()
- Executes SQL/query against ClickHouse or Prometheus
- Processes results into response format
- Returns Result with data and statistics
7. Caching (if applicable)
- Checks bucket cache for time series queries
- Executes queries for missing time ranges
- Merges cached and fresh results
8. Result Processing
querier.run()
- Executes all queries (with dependency resolution)
- Collects results and warnings
- Merges results from multiple queries
9. Post-Processing
postProcessResults()
- Applies formulas if present
- Handles variable substitution
- Formats results for response
10. HTTP Response
- Returns QueryRangeResponse with results
- Includes execution statistics
- Includes warnings if any
```
### Key Decision Points
1. **Query Type Selection**: Based on `QueryEnvelope.Type`
2. **Signal Selection**: For builder queries, based on `Signal` field
3. **Request Type Handling**: Different SQL generation for different request types
4. **Caching Strategy**: Only for time series queries with valid fingerprints
5. **Dependency Resolution**: Trace operators and formulas resolve dependencies first
---
## Key Components
### 1. Querier
**Location**: `pkg/querier/querier.go`
**Purpose**: Orchestrates query execution, caching, and result merging
**Key Methods**:
- `QueryRange()`: Main entry point for query execution
- `run()`: Executes queries and merges results
- `executeWithCache()`: Handles caching logic
- `mergeResults()`: Merges cached and fresh results
- `postProcessResults()`: Applies formulas and variable substitution
**Key Features**:
- Query orchestration across multiple query types
- Intelligent caching with bucket-based strategy
- Result merging from multiple queries
- Formula evaluation
- Time range optimization
- Step interval calculation and validation
### 2. Statement Builder Interface
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/`
**Purpose**: Converts query builder specifications into executable queries
**Interface**:
```go
type StatementBuilder[T any] interface {
Build(
ctx context.Context,
start uint64,
end uint64,
requestType RequestType,
query QueryBuilderQuery[T],
variables map[string]VariableItem,
) (*Statement, error)
}
```
**Implementations**:
- `traceQueryStatementBuilder` - Traces (`pkg/telemetrytraces/statement_builder.go`)
- `logQueryStatementBuilder` - Logs (`pkg/telemetrylogs/statement_builder.go`)
- `metricQueryStatementBuilder` - Metrics (`pkg/telemetrymetrics/statement_builder.go`)
**Key Features**:
- Field resolution via metadata store
- SQL generation for different request types
- Filter, aggregation, group by, ordering support
- Time range optimization
### 3. Query Interface
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/`
**Purpose**: Represents an executable query
**Interface**:
```go
type Query interface {
Execute(ctx context.Context) (*Result, error)
Fingerprint() string // For caching
Window() (uint64, uint64) // Time range
}
```
**Implementations**:
- `builderQuery[T]` - Builder queries (`pkg/querier/builder_query.go`)
- `promqlQuery` - PromQL queries (`pkg/querier/promql_query.go`)
- `chSQLQuery` - ClickHouse SQL queries (`pkg/querier/ch_sql_query.go`)
- `formulaQuery` - Formula queries (`pkg/querier/formula_query.go`)
- `traceOperatorQuery` - Trace operator queries (`pkg/querier/trace_operator_query.go`)
### 4. Telemetry Store
**Location**: `pkg/telemetrystore/`
**Purpose**: Abstraction layer for ClickHouse database access
**Key Methods**:
- `Query()`: Execute SQL query
- `QueryRow()`: Execute query returning single row
- `Select()`: Execute query returning multiple rows
**Implementation**: `clickhouseTelemetryStore` (`pkg/telemetrystore/clickhousetelemetrystore/`)
### 5. Metadata Store
**Location**: `pkg/types/telemetrytypes/`
**Purpose**: Provides metadata about available fields, keys, and attributes
**Key Methods**:
- `GetKeysMulti()`: Get field keys for multiple selectors
- `FetchTemporalityMulti()`: Get metric temporality information
**Implementation**: `telemetryMetadataStore` (`pkg/telemetrymetadata/`)
### 6. Bucket Cache
**Location**: `pkg/querier/`
**Purpose**: Caches query results by time buckets for performance
**Key Methods**:
- `GetMissRanges()`: Get time ranges not in cache
- `Put()`: Store query result in cache
**Features**:
- Bucket-based caching (aligned to step intervals)
- Automatic cache invalidation
- Parallel query execution for missing ranges
---
## Query Execution
### Builder Query Execution
**Location**: `pkg/querier/builder_query.go`
**Process**:
1. Statement builder generates SQL
2. SQL executed against ClickHouse via TelemetryStore
3. Results processed based on RequestType:
- TimeSeries: Grouped by time buckets and labels
- Scalar: Single value extraction
- Raw: Row-by-row processing
4. Statistics collected (rows scanned, bytes scanned, duration)
### PromQL Query Execution
**Location**: `pkg/querier/promql_query.go`
**Process**:
1. Query parsed by Prometheus engine
2. Executed against Prometheus-compatible data
3. Results converted to QueryRangeResponse format
### ClickHouse SQL Query Execution
**Location**: `pkg/querier/ch_sql_query.go`
**Process**:
1. SQL query executed directly
2. Results processed based on RequestType
3. Variable substitution applied
### Formula Query Execution
**Location**: `pkg/querier/formula_query.go`
**Process**:
1. Referenced queries executed first
2. Formula expression evaluated using govaluate
3. Results computed from query results
### Trace Operator Query Execution
**Location**: `pkg/querier/trace_operator_query.go`
**Process**:
1. Expression parsed to find dependencies
2. Referenced queries executed
3. Set operations applied (INTERSECT, UNION, EXCEPT)
4. Results combined
---
## Caching
### Caching Strategy
**Location**: `pkg/querier/querier.go:642`
**When Caching Applies**:
- Time series queries only
- Queries with valid fingerprints
- `NoCache` flag not set
**How It Works**:
1. Query fingerprint generated (includes query structure, filters, time range)
2. Cache checked for existing results
3. Missing time ranges identified
4. Queries executed only for missing ranges (parallel execution)
5. Fresh results merged with cached results
6. Merged result stored in cache
### Cache Key Generation
**Location**: `pkg/querier/builder_query.go:52`
The fingerprint includes:
- Signal type
- Source type
- Step interval
- Aggregations
- Filters
- Group by fields
- Time range (for cache key, not fingerprint)
### Cache Benefits
- **Performance**: Avoids re-executing identical queries
- **Efficiency**: Only queries missing time ranges
- **Parallelism**: Multiple missing ranges queried in parallel
---
## Result Processing
### Result Merging
**Location**: `pkg/querier/querier.go:795`
**Process**:
1. Results from multiple queries collected
2. For time series: Series merged by labels
3. For raw data: Rows combined
4. Statistics aggregated (rows scanned, bytes scanned, duration)
### Formula Evaluation
**Location**: `pkg/querier/formula_query.go`
**Process**:
1. Formula expression parsed
2. Referenced query results retrieved
3. Expression evaluated using govaluate library
4. Result computed and formatted
### Variable Substitution
**Location**: `pkg/querier/querier.go`
**Process**:
1. Variables extracted from request
2. Variable values substituted in queries
3. Applied to filters, aggregations, and other query parts
---
## Performance Considerations
### Query Optimization
1. **Time Range Optimization**:
- For trace queries with `trace_id` filter, query `trace_summary` first to narrow time range
- Use appropriate time ranges to limit data scanned
2. **Step Interval Calculation**:
- Automatic step interval calculation based on time range
- Minimum step interval enforcement
- Warnings for suboptimal intervals
3. **Index Usage**:
- Queries use time bucket columns (`ts_bucket_start`) for efficient filtering
- Proper filter placement for index utilization
4. **Limit Enforcement**:
- Raw data queries should include limits
- Pagination support via offset/cursor
### Best Practices
1. **Use Query Builder**: Prefer query builder over raw SQL for better optimization
2. **Limit Time Ranges**: Always specify reasonable time ranges
3. **Use Aggregations**: For large datasets, use aggregations instead of raw data
4. **Cache Awareness**: Be mindful of cache TTLs when testing
5. **Parallel Queries**: Multiple independent queries execute in parallel
6. **Step Intervals**: Let system calculate optimal step intervals
### Monitoring
Execution statistics are included in response:
- `RowsScanned`: Total rows scanned
- `BytesScanned`: Total bytes scanned
- `DurationMS`: Query execution time
- `StepIntervals`: Step intervals per query
---
## Extending the API
### Adding a New Query Type
1. **Define Query Type** (`pkg/types/querybuildertypes/querybuildertypesv5/query.go`):
```go
const (
QueryTypeMyNewType QueryType = "my_new_type"
)
```
2. **Define Query Spec**:
```go
type MyNewQuerySpec struct {
Name string
// ... your fields
}
```
3. **Update QueryEnvelope Unmarshaling** (`pkg/types/querybuildertypes/querybuildertypesv5/query.go`):
```go
case QueryTypeMyNewType:
var spec MyNewQuerySpec
if err := UnmarshalJSONWithContext(shadow.Spec, &spec, "my new query spec"); err != nil {
return wrapUnmarshalError(err, "invalid my new query spec: %v", err)
}
q.Spec = spec
```
4. **Implement Query Interface** (`pkg/querier/my_new_query.go`):
```go
type myNewQuery struct {
spec MyNewQuerySpec
// ... other fields
}
func (q *myNewQuery) Execute(ctx context.Context) (*qbtypes.Result, error) {
// Implementation
}
func (q *myNewQuery) Fingerprint() string {
// Generate fingerprint for caching
}
func (q *myNewQuery) Window() (uint64, uint64) {
// Return time range
}
```
5. **Update Querier** (`pkg/querier/querier.go`):
```go
case QueryTypeMyNewType:
myQuery, ok := query.Spec.(MyNewQuerySpec)
if !ok {
return nil, errors.NewInvalidInputf(...)
}
queries[myQuery.Name] = newMyNewQuery(myQuery, ...)
```
### Adding a New Request Type
1. **Define Request Type** (`pkg/types/querybuildertypes/querybuildertypesv5/req.go`):
```go
const (
RequestTypeMyNewType RequestType = "my_new_type"
)
```
2. **Update Statement Builders**: Add handling in `Build()` method
3. **Update Query Execution**: Add result processing for new type
4. **Update Response Models**: Add response data structure
### Adding a New Aggregation Function
1. **Update Aggregation Rewriter** (`pkg/querybuilder/agg_expr_rewriter.go`):
```go
func (r *aggExprRewriter) RewriteAggregation(expr string) (string, error) {
if strings.HasPrefix(expr, "my_function(") {
// Parse arguments
// Return ClickHouse SQL expression
return "myClickHouseFunction(...)", nil
}
// ... existing functions
}
```
2. **Update Documentation**: Document the new function
---
## Common Patterns
### Pattern 1: Simple Time Series Query
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "A",
Signal: telemetrytypes.SignalMetrics,
Aggregations: []qbtypes.MetricAggregation{
{Expression: "sum(rate)", Alias: "total"},
},
StepInterval: qbtypes.Step{Duration: time.Minute},
},
},
},
},
}
```
### Pattern 2: Query with Filter and Group By
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.TraceAggregation]{
Name: "A",
Signal: telemetrytypes.SignalTraces,
Filter: &qbtypes.Filter{
Expression: "service.name = 'api' AND duration_nano > 1000000",
},
Aggregations: []qbtypes.TraceAggregation{
{Expression: "count()", Alias: "total"},
},
GroupBy: []qbtypes.GroupByKey{
{TelemetryFieldKey: telemetrytypes.TelemetryFieldKey{
Name: "service.name",
FieldContext: telemetrytypes.FieldContextResource,
}},
},
},
},
},
},
}
```
### Pattern 3: Formula Query
```go
req := qbtypes.QueryRangeRequest{
Start: startMs,
End: endMs,
RequestType: qbtypes.RequestTypeTimeSeries,
CompositeQuery: qbtypes.CompositeQuery{
Queries: []qbtypes.QueryEnvelope{
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "A",
// ... query A definition
},
},
{
Type: qbtypes.QueryTypeBuilder,
Spec: qbtypes.QueryBuilderQuery[qbtypes.MetricAggregation]{
Name: "B",
// ... query B definition
},
},
{
Type: qbtypes.QueryTypeFormula,
Spec: qbtypes.QueryBuilderFormula{
Name: "C",
Expression: "A / B * 100",
},
},
},
},
}
```
---
## Testing
### Unit Tests
- `pkg/querier/querier_test.go` - Querier tests
- `pkg/querier/builder_query_test.go` - Builder query tests
- `pkg/querier/formula_query_test.go` - Formula query tests
### Integration Tests
- `tests/integration/` - End-to-end API tests
### Running Tests
```bash
# Run all querier tests
go test ./pkg/querier/...
# Run with verbose output
go test -v ./pkg/querier/...
# Run specific test
go test -v ./pkg/querier/ -run TestQueryRange
```
---
## Debugging
### Enable Debug Logging
```go
// In querier.go
q.logger.DebugContext(ctx, "Executing query",
"query", queryName,
"start", start,
"end", end)
```
### Common Issues
1. **Query Not Found**: Check query name matches in CompositeQuery
2. **SQL Errors**: Check generated SQL in logs, verify ClickHouse syntax
3. **Performance**: Check execution statistics, optimize time ranges
4. **Cache Issues**: Set `NoCache: true` to bypass cache
5. **Formula Errors**: Check formula expression syntax and referenced query names
---
## References
### Key Files
- `pkg/querier/querier.go` - Main query orchestration
- `pkg/querier/builder_query.go` - Builder query execution
- `pkg/types/querybuildertypes/querybuildertypesv5/` - Request/response models
- `pkg/telemetrystore/` - ClickHouse interface
- `pkg/telemetrymetadata/` - Metadata store
### Signal-Specific Documentation
- [Traces Module](./TRACES_MODULE.md) - Trace-specific details
- Logs module documentation (when available)
- Metrics module documentation (when available)
### Related Documentation
- [ClickHouse Documentation](https://clickhouse.com/docs)
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
---
## Contributing
When contributing to the Query Range API:
1. **Follow Existing Patterns**: Match the style of existing query types
2. **Add Tests**: Include unit tests for new functionality
3. **Update Documentation**: Update this doc for significant changes
4. **Consider Performance**: Optimize queries and use caching appropriately
5. **Handle Errors**: Provide meaningful error messages
For questions or help, reach out to the maintainers or open an issue.

View File

@@ -0,0 +1,185 @@
# SigNoz Span Metrics Processor
The `signozspanmetricsprocessor` is an OpenTelemetry Collector processor that intercepts trace data to generate RED metrics (Rate, Errors, Duration) from spans.
**Location:** `signoz-otel-collector/processor/signozspanmetricsprocessor/`
## Trace Interception
The processor implements `consumer.Traces` interface and sits in the traces pipeline:
```go
func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
p.lock.Lock()
p.aggregateMetrics(traces)
p.lock.Unlock()
return p.tracesConsumer.ConsumeTraces(ctx, traces) // forward unchanged
}
```
All traces flow through this method. Metrics are aggregated, then traces are forwarded unmodified to the next consumer.
## Metrics Generated
| Metric | Type | Description |
|--------|------|-------------|
| `signoz_latency` | Histogram | Span latency by service/operation/kind/status |
| `signoz_calls_total` | Counter | Call count per service/operation/kind/status |
| `signoz_db_latency_sum/count` | Counter | DB call latency (spans with `db.system` attribute) |
| `signoz_external_call_latency_sum/count` | Counter | External call latency (client spans with remote address) |
### Dimensions
All metrics include these base dimensions:
- `service.name` - from resource attributes
- `operation` - span name
- `span.kind` - SPAN_KIND_SERVER, SPAN_KIND_CLIENT, etc.
- `status.code` - STATUS_CODE_OK, STATUS_CODE_ERROR, etc.
Additional dimensions can be configured.
## Aggregation Flow
```
traces pipeline
┌─────────────────────────────────────────────────────────┐
│ ConsumeTraces() │
│ │ │
│ ▼ │
│ aggregateMetrics(traces) │
│ │ │
│ ├── for each ResourceSpan │
│ │ extract service.name │
│ │ │ │
│ │ ├── for each Span │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ aggregateMetricsForSpan() │
│ │ │ ├── skip stale spans (>24h) │
│ │ │ ├── skip excluded patterns │
│ │ │ ├── calculate latency │
│ │ │ ├── build metric key │
│ │ │ ├── update histograms │
│ │ │ └── cache dimensions │
│ │ │ │
│ ▼ │
│ forward traces to next consumer │
└─────────────────────────────────────────────────────────┘
```
### Periodic Export
A background goroutine exports aggregated metrics on a ticker interval:
```go
go func() {
for {
select {
case <-p.ticker.C:
p.exportMetrics(ctx) // build and send to metrics exporter
}
}
}()
```
## Key Design Features
### 1. Time Bucketing (Delta Temporality)
For delta temporality, metric keys include a time bucket prefix:
```go
if p.config.GetAggregationTemporality() == pmetric.AggregationTemporalityDelta {
p.AddTimeToKeyBuf(span.StartTimestamp().AsTime()) // truncated to interval
}
```
- Spans are grouped by time bucket (default: 1 minute)
- After export, buckets are reset
- Memory-efficient for high-cardinality data
### 2. LRU Dimension Caching
Dimension key-value maps are cached to avoid rebuilding:
```go
if _, has := p.metricKeyToDimensions.Get(k); !has {
p.metricKeyToDimensions.Add(k, p.buildDimensionKVs(...))
}
```
- Configurable cache size (`DimensionsCacheSize`)
- Evicted keys also removed from histograms
### 3. Cardinality Protection
Prevents memory explosion from high cardinality:
```go
if len(p.serviceToOperations) > p.maxNumberOfServicesToTrack {
serviceName = "overflow_service"
}
if len(p.serviceToOperations[serviceName]) > p.maxNumberOfOperationsToTrackPerService {
spanName = "overflow_operation"
}
```
Excess services/operations are aggregated into overflow buckets.
### 4. Exemplars
Trace/span IDs attached to histogram samples for metric-to-trace correlation:
```go
histo.exemplarsData = append(histo.exemplarsData, exemplarData{
traceID: traceID,
spanID: spanID,
value: latency,
})
```
Enables "show me a trace that caused this latency spike" in UI.
## Configuration Options
| Option | Description | Default |
|--------|-------------|---------|
| `metrics_exporter` | Target exporter for generated metrics | required |
| `latency_histogram_buckets` | Custom histogram bucket boundaries | 2,4,6,8,10,50,100,200,400,800,1000,1400,2000,5000,10000,15000 ms |
| `dimensions` | Additional span/resource attributes to include | [] |
| `dimensions_cache_size` | LRU cache size for dimension maps | 1000 |
| `aggregation_temporality` | cumulative or delta | cumulative |
| `time_bucket_interval` | Bucket interval for delta temporality | 1m |
| `skip_spans_older_than` | Skip stale spans | 24h |
| `max_services_to_track` | Cardinality limit for services | - |
| `max_operations_to_track_per_service` | Cardinality limit for operations | - |
| `exclude_patterns` | Regex patterns to skip spans | [] |
## Pipeline Configuration Example
```yaml
processors:
signozspanmetrics:
metrics_exporter: clickhousemetricswrite
latency_histogram_buckets: [2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms]
dimensions:
- name: http.method
- name: http.status_code
dimensions_cache_size: 10000
aggregation_temporality: delta
pipelines:
traces:
receivers: [otlp]
processors: [signozspanmetrics, batch]
exporters: [clickhousetraces]
metrics:
receivers: [otlp]
exporters: [clickhousemetricswrite]
```
The processor sits in the traces pipeline but exports to a metrics pipeline exporter.

View File

@@ -0,0 +1,832 @@
# SigNoz Traces Module - Developer Guide
This document provides a comprehensive guide to understanding and contributing to the traces module in SigNoz. It covers architecture, APIs, code flows, and implementation details.
## Table of Contents
1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Data Models](#data-models)
4. [API Endpoints](#api-endpoints)
5. [Code Flows](#code-flows)
6. [Key Components](#key-components)
7. [Query Building System](#query-building-system)
8. [Storage Schema](#storage-schema)
9. [Extending the Traces Module](#extending-the-traces-module)
---
## Overview
The traces module in SigNoz handles distributed tracing data from OpenTelemetry. It provides:
- **Ingestion**: Receives traces via OpenTelemetry Collector
- **Storage**: Stores traces in ClickHouse
- **Querying**: Supports complex queries with filters, aggregations, and trace operators
- **Visualization**: Provides waterfall and flamegraph views
- **Trace Funnels**: Advanced analytics for multi-step trace analysis
### Key Technologies
- **Backend**: Go (Golang)
- **Storage**: ClickHouse (columnar database)
- **Protocol**: OpenTelemetry Protocol (OTLP)
- **Query Language**: Custom query builder + ClickHouse SQL
---
## Architecture
### High-Level Flow
```
Application → OpenTelemetry SDK → OTLP Receiver →
[Processors: signozspanmetrics, batch] →
ClickHouse Traces Exporter → ClickHouse Database
Query Service (Go)
Frontend (React/TypeScript)
```
### Component Architecture
```
┌─────────────────────────────────────────────────────────┐
│ Frontend (React) │
│ - TracesExplorer │
│ - TraceDetail (Waterfall/Flamegraph) │
│ - Query Builder UI │
└────────────────────┬────────────────────────────────────┘
│ HTTP/REST API
┌────────────────────▼────────────────────────────────────┐
│ Query Service (Go) │
│ ┌──────────────────────────────────────────────────┐ │
│ │ HTTP Handlers (http_handler.go) │ │
│ │ - QueryRangeV5 (Main query endpoint) │ │
│ │ - GetWaterfallSpansForTrace │ │
│ │ - GetFlamegraphSpansForTrace │ │
│ │ - Trace Fields API │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Querier (querier.go) │ │
│ │ - Query orchestration │ │
│ │ - Cache management │ │
│ │ - Result merging │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Statement Builders │ │
│ │ - traceQueryStatementBuilder │ │
│ │ - traceOperatorStatementBuilder │ │
│ │ - Builds ClickHouse SQL from query specs │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ ClickHouse Reader (clickhouseReader/) │ │
│ │ - Direct trace retrieval │ │
│ │ - Waterfall/Flamegraph data processing │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────┬────────────────────────────────────┘
│ ClickHouse Protocol
┌────────────────────▼────────────────────────────────────┐
│ ClickHouse Database │
│ - signoz_traces.distributed_signoz_index_v3 │
│ - signoz_traces.distributed_trace_summary │
│ - signoz_traces.distributed_tag_attributes_v2 │
└──────────────────────────────────────────────────────────┘
```
---
## Data Models
### Core Trace Models
**Location**: `pkg/query-service/model/trace.go`
### Query Request Models
**Location**: `pkg/types/querybuildertypes/querybuildertypesv5/`
- `QueryRangeRequest`: Main query request structure
- `QueryBuilderQuery[TraceAggregation]`: Query builder specification for traces
- `QueryBuilderTraceOperator`: Trace operator query specification
- `CompositeQuery`: Container for multiple queries
---
## API Endpoints
### 1. Query Range API (V5) - Primary Query Endpoint
**Endpoint**: `POST /api/v5/query_range`
**Handler**: `QuerierAPI.QueryRange``querier.QueryRange`
**Purpose**: Main query endpoint for traces, logs, and metrics. Supports:
- Query builder queries
- Trace operator queries
- Aggregations, filters, group by
- Time series, scalar, and raw data requests
> **Note**: For detailed information about the Query Range API, including request/response models, query types, and common code flows, see the [Query Range API Documentation](./QUERY_RANGE_API.md).
**Trace-Specific Details**:
- Uses `traceQueryStatementBuilder` for SQL generation
- Supports trace-specific aggregations (count, avg, p99, etc. on duration_nano)
- Trace operator queries combine multiple trace queries with set operations
- Time range optimization when `trace_id` filter is present
**Key Files**:
- `pkg/telemetrytraces/statement_builder.go` - Trace SQL generation
- `pkg/telemetrytraces/trace_operator_statement_builder.go` - Trace operator SQL
- `pkg/querier/trace_operator_query.go` - Trace operator execution
### 2. Waterfall View API
**Endpoint**: `POST /api/v2/traces/waterfall/{traceId}`
**Handler**: `GetWaterfallSpansForTraceWithMetadata`
**Purpose**: Retrieves spans for waterfall visualization with metadata
**Request Parameters**:
```go
type GetWaterfallSpansForTraceWithMetadataParams struct {
SelectedSpanID string // Selected span to focus on
IsSelectedSpanIDUnCollapsed bool // Whether selected span is expanded
UncollapsedSpans []string // List of expanded span IDs
}
```
**Response**:
```go
type GetWaterfallSpansForTraceWithMetadataResponse struct {
StartTimestampMillis uint64 // Trace start time
EndTimestampMillis uint64 // Trace end time
DurationNano uint64 // Total duration
RootServiceName string // Root service
RootServiceEntryPoint string // Entry point operation
TotalSpansCount uint64 // Total spans
TotalErrorSpansCount uint64 // Error spans
ServiceNameToTotalDurationMap map[string]uint64 // Service durations
Spans []*Span // Span tree
HasMissingSpans bool // Missing spans indicator
UncollapsedSpans []string // Expanded spans
}
```
**Code Flow**:
```
Handler → ClickHouseReader.GetWaterfallSpansForTraceWithMetadata
→ Query trace_summary for time range
→ Query spans from signoz_index_v3
→ Build span tree structure
→ Apply uncollapsed/selected span logic
→ Return filtered spans (500 span limit)
```
**Key Files**:
- `pkg/query-service/app/http_handler.go:1748` - Handler
- `pkg/query-service/app/clickhouseReader/reader.go:873` - Implementation
- `pkg/query-service/app/traces/tracedetail/waterfall.go` - Tree processing
### 3. Flamegraph View API
**Endpoint**: `POST /api/v2/traces/flamegraph/{traceId}`
**Handler**: `GetFlamegraphSpansForTrace`
**Purpose**: Retrieves spans organized by level for flamegraph visualization
**Request Parameters**:
```go
type GetFlamegraphSpansForTraceParams struct {
SelectedSpanID string // Selected span ID
}
```
**Response**:
```go
type GetFlamegraphSpansForTraceResponse struct {
StartTimestampMillis uint64 // Trace start
EndTimestampMillis uint64 // Trace end
DurationNano uint64 // Total duration
Spans [][]*FlamegraphSpan // Spans organized by level
}
```
**Code Flow**:
```
Handler → ClickHouseReader.GetFlamegraphSpansForTrace
→ Query trace_summary for time range
→ Query spans from signoz_index_v3
→ Build span tree
→ BFS traversal to organize by level
→ Sample spans (50 levels, 100 spans/level max)
→ Return level-organized spans
```
**Key Files**:
- `pkg/query-service/app/http_handler.go:1781` - Handler
- `pkg/query-service/app/clickhouseReader/reader.go:1091` - Implementation
- `pkg/query-service/app/traces/tracedetail/flamegraph.go` - BFS processing
### 4. Trace Fields API
**Endpoint**:
- `GET /api/v2/traces/fields` - Get available trace fields
- `POST /api/v2/traces/fields` - Update trace field metadata
**Handler**: `traceFields`, `updateTraceField`
**Purpose**: Manage trace field metadata for query builder
**Key Files**:
- `pkg/query-service/app/http_handler.go:4912` - Get handler
- `pkg/query-service/app/http_handler.go:4921` - Update handler
### 5. Trace Funnels API
**Endpoint**: `/api/v1/trace-funnels/*`
**Purpose**: Manage trace funnels (multi-step trace analysis)
**Endpoints**:
- `POST /api/v1/trace-funnels/new` - Create funnel
- `GET /api/v1/trace-funnels/list` - List funnels
- `GET /api/v1/trace-funnels/{funnel_id}` - Get funnel
- `PUT /api/v1/trace-funnels/{funnel_id}` - Update funnel
- `DELETE /api/v1/trace-funnels/{funnel_id}` - Delete funnel
- `POST /api/v1/trace-funnels/{funnel_id}/analytics/*` - Analytics endpoints
**Key Files**:
- `pkg/query-service/app/http_handler.go:5084` - Route registration
- `pkg/modules/tracefunnel/` - Funnel implementation
---
## Code Flows
### Flow 1: Query Range Request (V5)
This is the primary query flow for traces. For the complete flow covering all query types, see the [Query Range API Documentation](./QUERY_RANGE_API.md#code-flow).
**Trace-Specific Flow**:
```
1. HTTP Request
POST /api/v5/query_range
2. Querier.QueryRange (common flow - see QUERY_RANGE_API.md)
3. Trace Query Processing:
a. Builder Query (QueryTypeBuilder with SignalTraces):
- newBuilderQuery() creates builderQuery instance
- Uses traceStmtBuilder (traceQueryStatementBuilder)
b. Trace Operator Query (QueryTypeTraceOperator):
- newTraceOperatorQuery() creates traceOperatorQuery
- Uses traceOperatorStmtBuilder
4. Trace Statement Building
traceQueryStatementBuilder.Build() (pkg/telemetrytraces/statement_builder.go:58)
- Resolves trace field keys from metadata store
- Optimizes time range if trace_id filter present (queries trace_summary)
- Maps fields using traceFieldMapper
- Builds conditions using traceConditionBuilder
- Builds SQL based on request type:
* RequestTypeRaw → buildListQuery()
* RequestTypeTimeSeries → buildTimeSeriesQuery()
* RequestTypeScalar → buildScalarQuery()
* RequestTypeTrace → buildTraceQuery()
5. Query Execution
builderQuery.Execute() (pkg/querier/builder_query.go)
- Executes SQL against ClickHouse (signoz_traces database)
- Processes results into response format
6. Result Processing (common flow - see QUERY_RANGE_API.md)
- Merges results from multiple queries
- Applies formulas if present
- Handles caching
7. HTTP Response
- Returns QueryRangeResponse with trace results
```
**Trace-Specific Key Components**:
- `pkg/telemetrytraces/statement_builder.go` - Trace SQL generation
- `pkg/telemetrytraces/field_mapper.go` - Trace field mapping
- `pkg/telemetrytraces/condition_builder.go` - Trace filter building
- `pkg/telemetrytraces/trace_operator_statement_builder.go` - Trace operator SQL
### Flow 2: Waterfall View Request
```
1. HTTP Request
POST /api/v2/traces/waterfall/{traceId}
2. GetWaterfallSpansForTraceWithMetadata handler
- Extracts traceId from URL
- Parses request body for params
3. ClickHouseReader.GetWaterfallSpansForTraceWithMetadata
- Checks cache first (5 minute TTL)
4. If cache miss:
a. Query trace_summary table
SELECT * FROM distributed_trace_summary WHERE trace_id = ?
- Gets time range (start, end, num_spans)
b. Query spans table
SELECT ... FROM distributed_signoz_index_v3
WHERE trace_id = ?
AND ts_bucket_start >= ? AND ts_bucket_start <= ?
- Retrieves all spans for trace
c. Build span tree
- Parse references to build parent-child relationships
- Identify root spans (no parent)
- Calculate service durations
d. Cache result
5. Apply selection logic
tracedetail.GetSelectedSpans()
- Traverses tree based on uncollapsed spans
- Finds path to selected span
- Returns sliding window (500 spans max)
6. HTTP Response
- Returns spans with metadata
```
**Key Components**:
- `pkg/query-service/app/clickhouseReader/reader.go:873`
- `pkg/query-service/app/traces/tracedetail/waterfall.go`
- `pkg/query-service/model/trace.go`
### Flow 3: Trace Operator Query
Trace operators allow combining multiple trace queries with set operations.
```
1. QueryRangeRequest with QueryTypeTraceOperator
2. Querier identifies trace operator queries
- Parses expression to find dependencies
- Collects referenced queries
3. traceOperatorStatementBuilder.Build()
- Parses expression (e.g., "A AND B", "A OR B")
- Builds expression tree
4. traceOperatorCTEBuilder.build()
- Creates CTEs (Common Table Expressions) for each query
- Builds final query with set operations:
* AND → INTERSECT
* OR → UNION
* NOT → EXCEPT
5. Execute combined query
- Returns traces matching the operator expression
```
**Key Components**:
- `pkg/telemetrytraces/trace_operator_statement_builder.go`
- `pkg/telemetrytraces/trace_operator_cte_builder.go`
- `pkg/querier/trace_operator_query.go`
---
## Key Components
> **Note**: For common components used across all signals (Querier, TelemetryStore, MetadataStore, etc.), see the [Query Range API Documentation](./QUERY_RANGE_API.md#key-components).
### 1. Trace Statement Builder
**Location**: `pkg/telemetrytraces/statement_builder.go`
**Purpose**: Converts trace query builder specifications into ClickHouse SQL
**Key Methods**:
- `Build()`: Main entry point, builds SQL statement
- `buildListQuery()`: Builds query for raw/list results
- `buildTimeSeriesQuery()`: Builds query for time series
- `buildScalarQuery()`: Builds query for scalar values
- `buildTraceQuery()`: Builds query for trace-specific results
**Key Features**:
- Trace field resolution via metadata store
- Time range optimization for trace_id filters (queries trace_summary first)
- Support for trace aggregations, filters, group by, ordering
- Calculated field support (http_method, db_name, has_error, etc.)
- Resource filter support via resourceFilterStmtBuilder
### 2. Trace Field Mapper
**Location**: `pkg/telemetrytraces/field_mapper.go`
**Purpose**: Maps trace query field names to ClickHouse column names
**Field Types**:
- **Intrinsic Fields**: Built-in fields (trace_id, span_id, duration_nano, name, kind_string, status_code_string, etc.)
- **Calculated Fields**: Derived fields (http_method, db_name, has_error, response_status_code, etc.)
- **Attribute Fields**: Dynamic span/resource attributes (accessed via attributes_string, attributes_number, attributes_bool, resources_string)
**Example Mapping**:
```
"service.name" → "resource_string_service$$name"
"http.method" → Calculated from attributes_string['http.method']
"duration_nano" → "duration_nano" (intrinsic)
"trace_id" → "trace_id" (intrinsic)
```
**Key Methods**:
- `MapField()`: Maps a field to ClickHouse expression
- `MapAttribute()`: Maps attribute fields
- `MapResource()`: Maps resource fields
### 3. Trace Condition Builder
**Location**: `pkg/telemetrytraces/condition_builder.go`
**Purpose**: Builds WHERE clause conditions from trace filter expressions
**Supported Operators**:
- `=`, `!=`, `IN`, `NOT IN`
- `>`, `>=`, `<`, `<=`
- `LIKE`, `NOT LIKE`, `ILIKE`
- `EXISTS`, `NOT EXISTS`
- `CONTAINS`, `NOT CONTAINS`
**Key Methods**:
- `BuildCondition()`: Builds condition from filter expression
- Handles attribute, resource, and intrinsic field filtering
### 4. Trace Operator Statement Builder
**Location**: `pkg/telemetrytraces/trace_operator_statement_builder.go`
**Purpose**: Builds SQL for trace operator queries (AND, OR, NOT operations on trace queries)
**Key Methods**:
- `Build()`: Builds CTE-based SQL for trace operators
- Uses `traceOperatorCTEBuilder` to create Common Table Expressions
**Features**:
- Parses operator expressions (e.g., "A AND B")
- Creates CTEs for each referenced query
- Combines results using INTERSECT, UNION, EXCEPT
### 5. ClickHouse Reader (Trace-Specific Methods)
**Location**: `pkg/query-service/app/clickhouseReader/reader.go`
**Purpose**: Direct trace data retrieval and processing (bypasses query builder)
**Key Methods**:
- `GetWaterfallSpansForTraceWithMetadata()`: Waterfall view data
- `GetFlamegraphSpansForTrace()`: Flamegraph view data
- `SearchTraces()`: Legacy trace search (still used for some flows)
- `GetMinAndMaxTimestampForTraceID()`: Time range optimization helper
**Caching**: Implements 5-minute cache for trace detail views
**Note**: These methods are used for trace-specific visualizations. For general trace queries, use the Query Range API.
---
## Query Building System
> **Note**: For general query building concepts and patterns, see the [Query Range API Documentation](./QUERY_RANGE_API.md). This section covers trace-specific aspects.
### Trace Query Builder Structure
A trace query consists of:
```go
QueryBuilderQuery[TraceAggregation] {
Name: "query_name",
Signal: SignalTraces,
Filter: &Filter {
Expression: "service.name = 'api' AND duration_nano > 1000000"
},
Aggregations: []TraceAggregation {
{Expression: "count()", Alias: "total"},
{Expression: "avg(duration_nano)", Alias: "avg_duration"},
{Expression: "p99(duration_nano)", Alias: "p99"},
},
GroupBy: []GroupByKey {
{TelemetryFieldKey: {Name: "service.name", ...}},
},
Order: []OrderBy {...},
Limit: 100,
}
```
### Trace-Specific SQL Generation Process
1. **Field Resolution**:
- Resolve trace field names using `traceFieldMapper`
- Handle intrinsic, calculated, and attribute fields
- Map to ClickHouse columns (e.g., `service.name``resource_string_service$$name`)
2. **Time Range Optimization**:
- If `trace_id` filter present, query `trace_summary` first
- Narrow time range based on trace start/end times
- Reduces data scanned significantly
3. **Filter Building**:
- Convert filter expression using `traceConditionBuilder`
- Handle attribute filters (attributes_string, attributes_number, attributes_bool)
- Handle resource filters (resources_string)
- Handle intrinsic field filters
4. **Aggregation Building**:
- Build SELECT with trace aggregations
- Support trace-specific functions (count, avg, p99, etc. on duration_nano)
5. **Group By Building**:
- Add GROUP BY clause with trace fields
- Support grouping by service.name, operation name, etc.
6. **Order Building**:
- Add ORDER BY clause
- Support ordering by duration, timestamp, etc.
7. **Limit/Offset**:
- Add pagination
### Example Generated SQL
For query: `count() WHERE service.name = 'api' GROUP BY service.name`
```sql
SELECT
count() AS total,
resource_string_service$$name AS service_name
FROM signoz_traces.distributed_signoz_index_v3
WHERE
timestamp >= toDateTime64(1234567890/1e9, 9)
AND timestamp <= toDateTime64(1234567899/1e9, 9)
AND ts_bucket_start >= toDateTime64(1234567890/1e9, 9)
AND ts_bucket_start <= toDateTime64(1234567899/1e9, 9)
AND resource_string_service$$name = 'api'
GROUP BY resource_string_service$$name
```
**Note**: The query uses `ts_bucket_start` for efficient time filtering (partitioning column).
---
## Storage Schema
### Main Tables
**Location**: `pkg/telemetrytraces/tables.go`
#### 1. `distributed_signoz_index_v3`
Main span index table. Stores all span data.
**Key Columns**:
- `timestamp`: Span timestamp
- `duration_nano`: Span duration
- `span_id`, `trace_id`: Identifiers
- `has_error`: Error indicator
- `kind`: Span kind
- `name`: Operation name
- `attributes_string`, `attributes_number`, `attributes_bool`: Attributes
- `resources_string`: Resource attributes
- `events`: Span events
- `status_code_string`, `status_message`: Status
- `ts_bucket_start`: Time bucket for partitioning
#### 2. `distributed_trace_summary`
Trace-level summary for quick lookups.
**Columns**:
- `trace_id`: Trace identifier
- `start`: Earliest span timestamp
- `end`: Latest span timestamp
- `num_spans`: Total span count
#### 3. `distributed_tag_attributes_v2`
Metadata table for attribute keys.
**Purpose**: Stores available attribute keys for autocomplete
#### 4. `distributed_span_attributes_keys`
Span attribute keys metadata.
**Purpose**: Tracks which attributes exist in spans
### Database
All trace tables are in the `signoz_traces` database.
---
## Extending the Traces Module
### Adding a New Calculated Field
1. **Define Field in Constants** (`pkg/telemetrytraces/const.go`):
```go
CalculatedFields = map[string]telemetrytypes.TelemetryFieldKey{
"my_new_field": {
Name: "my_new_field",
Description: "Description of the field",
Signal: telemetrytypes.SignalTraces,
FieldContext: telemetrytypes.FieldContextSpan,
FieldDataType: telemetrytypes.FieldDataTypeString,
},
}
```
2. **Implement Field Mapping** (`pkg/telemetrytraces/field_mapper.go`):
```go
func (fm *fieldMapper) MapField(field telemetrytypes.TelemetryFieldKey) (string, error) {
if field.Name == "my_new_field" {
// Return ClickHouse expression
return "attributes_string['my.attribute.key']", nil
}
// ... existing mappings
}
```
3. **Update Condition Builder** (if needed for filtering):
```go
// In condition_builder.go, add support for your field
```
### Adding a New API Endpoint
1. **Add Handler Method** (`pkg/query-service/app/http_handler.go`):
```go
func (aH *APIHandler) MyNewTraceHandler(w http.ResponseWriter, r *http.Request) {
// Extract parameters
// Call reader or querier
// Return response
}
```
2. **Register Route** (in `RegisterRoutes` or separate method):
```go
router.HandleFunc("/api/v2/traces/my-endpoint",
am.ViewAccess(aH.MyNewTraceHandler)).Methods(http.MethodPost)
```
3. **Implement Logic**:
- Add to `ClickHouseReader` if direct DB access needed
- Or use `Querier` for query builder queries
### Adding a New Aggregation Function
1. **Update Aggregation Rewriter** (`pkg/querybuilder/agg_expr_rewriter.go`):
```go
func (r *aggExprRewriter) RewriteAggregation(expr string) (string, error) {
// Add parsing for your function
if strings.HasPrefix(expr, "my_function(") {
// Return ClickHouse SQL expression
return "myClickHouseFunction(...)", nil
}
}
```
2. **Update Statement Builder** (if special handling needed):
```go
// In statement_builder.go, add special case if needed
```
### Adding Trace Operator Support
Trace operators are already extensible. To add a new operator:
1. **Update Grammar** (`grammar/TraceOperatorGrammar.g4`):
```antlr
operator: AND | OR | NOT | MY_NEW_OPERATOR;
```
2. **Update CTE Builder** (`pkg/telemetrytraces/trace_operator_cte_builder.go`):
```go
func (b *traceOperatorCTEBuilder) buildOperatorQuery(op TraceOperatorType) string {
switch op {
case TraceOperatorTypeMyNewOperator:
return "MY_CLICKHOUSE_OPERATION"
}
}
```
---
## Common Patterns
### Pattern 1: Query with Filter
```go
query := qbtypes.QueryBuilderQuery[qbtypes.TraceAggregation]{
Name: "filtered_traces",
Signal: telemetrytypes.SignalTraces,
Filter: &qbtypes.Filter{
Expression: "service.name = 'api' AND duration_nano > 1000000",
},
Aggregations: []qbtypes.TraceAggregation{
{Expression: "count()", Alias: "total"},
},
}
```
### Pattern 2: Time Series Query
```go
query := qbtypes.QueryBuilderQuery[qbtypes.TraceAggregation]{
Name: "time_series",
Signal: telemetrytypes.SignalTraces,
Aggregations: []qbtypes.TraceAggregation{
{Expression: "avg(duration_nano)", Alias: "avg_duration"},
},
GroupBy: []qbtypes.GroupByKey{
{TelemetryFieldKey: telemetrytypes.TelemetryFieldKey{
Name: "service.name",
FieldContext: telemetrytypes.FieldContextResource,
}},
},
StepInterval: qbtypes.Step{Duration: time.Minute},
}
```
### Pattern 3: Trace Operator Query
```go
query := qbtypes.QueryBuilderTraceOperator{
Name: "operator_query",
Expression: "A AND B", // A and B are query names
Filter: &qbtypes.Filter{
Expression: "duration_nano > 5000000",
},
}
```
---
## Performance Considerations
### Caching
- **Trace Detail Views**: 5-minute cache for waterfall/flamegraph
- **Query Results**: Bucket-based caching in querier
- **Metadata**: Cached attribute keys and field metadata
### Query Optimization
1. **Time Range Optimization**: When `trace_id` is in filter, query `trace_summary` first to narrow time range
2. **Index Usage**: Queries use `ts_bucket_start` for time filtering
3. **Limit Enforcement**: Waterfall/flamegraph have span limits (500/50)
### Best Practices
1. **Use Query Builder**: Prefer query builder over raw SQL for better optimization
2. **Limit Time Ranges**: Always specify reasonable time ranges
3. **Use Aggregations**: For large datasets, use aggregations instead of raw data
4. **Cache Awareness**: Be mindful of cache TTLs when testing
---
## References
### Key Files
- `pkg/telemetrytraces/` - Core trace query building
- `statement_builder.go` - Trace SQL generation
- `field_mapper.go` - Trace field mapping
- `condition_builder.go` - Trace filter building
- `trace_operator_statement_builder.go` - Trace operator SQL
- `pkg/query-service/app/clickhouseReader/reader.go` - Direct trace access
- `pkg/query-service/app/http_handler.go` - API handlers
- `pkg/query-service/model/trace.go` - Data models
### Related Documentation
- [Query Range API Documentation](./QUERY_RANGE_API.md) - Common query_range API details
- [OpenTelemetry Specification](https://opentelemetry.io/docs/specs/)
- [ClickHouse Documentation](https://clickhouse.com/docs)
- [Query Builder Guide](../contributing/go/query-builder.md)
---
## Contributing
When contributing to the traces module:
1. **Follow Existing Patterns**: Match the style of existing code
2. **Add Tests**: Include unit tests for new functionality
3. **Update Documentation**: Update this doc for significant changes
4. **Consider Performance**: Optimize queries and use caching appropriately
5. **Handle Errors**: Provide meaningful error messages
For questions or help, reach out to the maintainers or open an issue.