Files
signoz/pkg
Jatinderjit Singh 6cf22e98dd feat(planned-downtime): scope maintenance windows to label expressions (#11186)
* add maintenanceMuteStage to move planned maintenance to alertmanager

Rules previously skipped rule.Eval() entirely during maintenance windows.
This change moves suppression to MaintenanceMuter, injected as a Stage
in the alertmanager notification pipeline. Now rules always evaluate and
everys suppression is handled by alertmanager.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: wrap routing pipeline once instead of per-route injection

Replace the per-route-entry loop with a single MultiStage wrap so
maintenance suppression runs once per dispatch group before routing.

* refactor: move maintenance mute stage into custom pipelineBuilder

Copy notify.PipelineBuilder locally so we can inject mms between the
silence stage and the receiver stage (GossipSettle → Inhibit →
TimeActive → TimeMute → Silence → mms → Receiver), matching the
correct suppression order the team requires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: add license header to pipeline_builder.go

Copied code originates from Apache-2.0 licensed Prometheus Alertmanager;
add dual copyright + SPDX identifier following the repo's convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: replace SPDX tag with full Apache 2.0 license boilerplate

The full license text is unambiguously compliant with Apache 2.0 Section 4(a),
which requires giving recipients "a copy of this License".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: pass MaintenanceMuter directly to pipelineBuilder

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: remove dead orgID param from task constructors

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* rename buildReceiverStage -> createReceiverStage

* refactor: replace maintenanceMuteStage with notify.NewMuteStage

MaintenanceMuter already satisfies types.Muter, and pipelineBuilder has
its own pb.metrics, so the hand-rolled maintenanceMuteStage wrapper is
redundant. Use notify.NewMuteStage(pb.muter, pb.metrics) directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: hoist MuteStage construction out of the receiver loop

MuteStage holds no per-receiver state, so one instance shared across
all receivers is sufficient — matching how is/ss are handled upstream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: always initialize maintenanceStore; remove nil guards

Tests now use a real sqlrulestore-backed MaintenanceMuter instead of
passing nil. With nil no longer a valid input, remove the nil guards
in server.go and pipeline_builder.go.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move MaintenanceMuter to Server and pass it to pipelineBuilder.New

- Remove muter from pipelineBuilder struct and newPipelineBuilder();
  pass it as a parameter to New() instead, consistent with inhibitor/silencer
- Store muter on Server so GetAlerts can call Mutes() alongside the
  inhibitor and silencer, ensuring maintenance-suppressed alerts show
  the correct muted status in API responses

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* remove redundant MemMarker wrapper

* feat: surface maintenance-suppressed alerts via mutedBy in GetAlerts

Alerts suppressed by an active maintenance window were being correctly
muted in the notification pipeline but appeared as state=active in the
v2 GetAlerts response, since MaintenanceMuter.Mutes had no marker
side-effect (unlike inhibitor/silencer).

Add MaintenanceMuter.MutedBy returning the matching window IDs, and
plumb a mutedByFunc callback through NewGettableAlertsFromAlertProvider
into AlertToOpenAPIAlert. The upstream v2 API forces state=suppressed
when mutedBy is non-empty, so the frontend's existing state-based
rendering picks it up without further changes.

Use the dedicated mutedBy field rather than SilencedBy to avoid
violating the "complete set of silence IDs" contract that anything
querying silences by ID would rely on.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* code cleanup

* refactor: move maintenance (planned downtime) to alertmanager packages

Types move from pkg/types/ruletypes/ to pkg/types/alertmanagertypes/:
- maintenance.go, recurrence.go, schedule.go (+ tests)

Store impl moves from pkg/ruler/rulestore/sqlrulestore/ to
pkg/alertmanager/alertmanagerstore/sqlalertmanagerstore/.

Maintenance windows mute alerts, so they belong with alertmanager
rather than the rule types.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: add unit tests for MaintenanceMuter

Covers Mutes/MutedBy semantics (empty label, rule match, empty-RuleIDs
matches-all, future windows, multi-window) and the result cache
(single-fetch within TTL, stale-cache fallback on store error,
re-fetch after expiry, concurrency safety).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Update schema changes

* Re-add marker

* fix NewMaintenanceStore in tests

* Go lint fixes

* test: use mockery-generated mock for MaintenanceStore in muter tests

Replace hand-written fakeMaintenanceStore with a mockery-generated
MockMaintenanceStore, consistent with the alertmanagertest pattern.
Also adds MaintenanceStore to .mockery.yml so the mock stays in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: regenerate mocks via make gen-mocks

Picks up new MockHandler for the Handler interface in pkg/alertmanager
and regenerates MockMaintenanceStore with canonical mockery formatting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* cleanup test

* test: add e2e muting tests for maintenance window behaviour

* Add label expression support to planned downtime

Alert instances can now be scoped by label expression
(e.g. env == "prod"), scoping suppression below the rule level.
A window with no rule IDs and a label expression silences any
alert whose labels match, regardless of which rule fired it;
when rule IDs are also present, the expression is evaluated only
within the matched rules.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove redundant || undefined from labelExpression assignment

* Move label expression evaluation into ShouldSkip

ShouldSkip now owns all three suppression checks in sequence:
rule ID match → schedule active → label expression.
IsActive passes nil labels so the expression check is skipped
(no instance labels available for UI status).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* remove redundant LabelSet->map conversion

* implement Down migration to drop label_expression column

* fix lset type and update openapi spec

* fix(tests): resolve envprovider env isolation, factory name length, and ShouldSkip signature

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* remove unused function `evaluateLabelExpression`

* chore: rename label expression to scope

* test(maintenance): add tests for scope label expression filtering

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* test(e2e): verify scope-based maintenance muting in alertmanager flow

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Use `AND` instead of `&&`

Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>

* fix(maintenance): consolidate label-set-to-env conversion to avoid expr panic

Move ConvertLabelSetToEnv to alertmanagertypes so both the maintenance
scope evaluator and the route-policy evaluator share one implementation.
Dotted label keys (e.g. kubernetes.node) are expanded into nested maps,
preventing the expr-lang panic that occurs when one key is a prefix of
another.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(planned-downtime): use clickable learn more link in scope tooltip

* chore(sqlmigration): renumber add_scope_to_planned_maintenance to 086

078 collided with add_sa_managed_role_txn; bumped to the next free number
and reordered registration after add_source_to_dashboard (085).

* refactor(maintenance): extract scope expression eval and surface errors

- Move ConvertLabelSetToEnv and EvalScopeExpression into expression.go
  with companion tests in expression_test.go.
- EvalScopeExpression now returns (bool, error) instead of swallowing
  compile/run failures and non-bool outputs; ShouldSkip logs the error
  via slog.Default() and falls back to not suppressing (safety-first).
- Update test fixtures to the SQL-style operator form (`=`, AND, OR)
  matching the placeholder and reviewer suggestions.

* chore: use `=` instead of `==` in expressions

* fix(maintenance): satisfy forbidigo/sloglint in scope eval

- Replace fmt.Errorf with pkg/errors Wrapf/Newf using a new
  ErrCodeInvalidScopeExpression code.
- Use slog ErrorContext (with context.Background()) instead of Error to
  satisfy sloglint.

* perf(maintenance): fold prefix-conflict detection into ConvertLabelSetToEnv

ConvertLabelSetToEnv now returns (env, conflict). The rulebased provider
drops its O(n^2) pre-scan and logs based on the flag, restoring the
previous O(n*d) cost while keeping the shared helper.

* chore: add docs URL for invalid scope

* refactor: don't log in types package

* remove down migration

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Srikanth Chekuri <srikanth.chekuri92@gmail.com>
2026-05-25 13:40:06 +00:00
..
2026-05-18 10:50:27 +00:00