SLO and Alerting

Suggested SLOs

  • Decision availability (/v1/decision usable): >= 99.9%
  • Readiness (/readyz): >= 99.95%
  • Reject classification accuracy: no unexplained reason spikes
  • SaaS connectivity (for SaaS mode): target >= 99.0%

Core alerts

1) Reject spike

- alert: FairvisorRejectSpike
  expr: rate(fairvisor_decisions_total{action="reject"}[5m]) > 50
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Fairvisor reject rate spike"

2) No bundle loaded

- alert: FairvisorNoBundleLoaded
  expr: rate(fairvisor_decisions_total{action="reject",reason="no_bundle_loaded"}[5m]) > 0
  for: 2m
  labels:
    severity: critical

3) SaaS disconnected

- alert: FairvisorSaasDisconnected
  expr: fairvisor_saas_reachable == 0
  for: 5m
  labels:
    severity: warning

4) Descriptor mismatch regression

- alert: FairvisorDescriptorMissing
  expr: rate(fairvisor_descriptor_missing_total[5m]) > 0
  for: 10m
  labels:
    severity: warning
  • reject rate by reason
  • allow/reject split
  • retry-after bucket distribution
  • saas reachable state
  • top descriptor-missing keys