Incident-Response & Alert-Routing
En resumen
Formalised incident-response process for the platform: Sev1/Sev2/Sev3 severity matrix with escalation paths and SLA wording, documented on-call rotation, blameless postmortems, Azure Monitor action group provisioned via Bicep, concrete alert rules for 5xx rate, Cosmos RU throttling, ACS email bounces, container restarts, API latency and Stripe webhook errors, plus a public IncidentLog driving the multilingual live status page.
Cómo funciona
The severity matrix in docs/engineering/operations/incident-response.md formalises Sev1/Sev2/Sev3 with explicit escalation paths, response times and SLA wording so an on-call engineer at 3 a.m. does not have to invent triage. Sev1 is total or partial outage with revenue or data integrity impact and a 15-minute response target; Sev2 is degraded service with workaround; Sev3 is non-urgent. The on-call rotation in docs/engineering/operations/on-call-rotation.md publishes the schedule, contact details and vacation handling so paging never lands on an unreachable number. Every Sev1 and Sev2 closes with a blameless postmortem from the docs/engineering/operations/postmortem-template.md, including timeline, root cause, contributing factors and an action list with owners.
Alerting runs through Azure Monitor. Action group ag-petanque-primary (email + Slack webhook) is provisioned via infrastructure/monitoring.bicep and applied with infrastructure/deploy-monitoring.sh, so the channel topology is in version control. Sev1 alert rules cover 5xx-rate > 1 % over 5 minutes, Cosmos RU throttling > 10 % over 5 minutes and ACS email bounce rate > 5 % over 15 minutes — failures that demand immediate attention. Sev2 covers Container App restarts > 3/h, API latency p95 > 2 s over 10 minutes and Stripe webhook errors > 5 % — degraded but not bleeding.
The incident lifecycle is captured server-side in IncidentLog (a Beanie model) with severity, workflow status (investigating / identified / monitoring / resolved), public_message and an updates list. Platform admins create and update incidents through POST /admin/incidents and PATCH /admin/incidents/{id}; updates are append-only on the embedded list so the public timeline is never edited after the fact. GET /public/incidents?since=<iso> serves the last 30 days of public incidents, sorted newest-first, with only public fields exposed. The status page (www/src/pages/status.astro plus fr/es/sv variants) reads this endpoint and renders severity and workflow badges so users see the same picture the on-call sees, without authentication.
Capacidades clave
- Sev1/Sev2/Sev3 severity matrix with escalation paths, response times and SLA wording
- Documented on-call rotation with schedule, contact details and vacation handling
- Blameless postmortem template covering timeline, root cause, contributing factors and action list
- Azure Monitor action group provisioned via Bicep with email + Slack webhook delivery
- Sev1 alert rules: 5xx-rate, Cosmos RU throttling, ACS email bounce rate
- Sev2 alert rules: Container App restarts, API latency p95, Stripe webhook errors
- IncidentLog model with severity, workflow status, public_message and append-only updates list
- Platform-admin POST/PATCH /admin/incidents endpoints and public GET /public/incidents feed
- Multilingual status page (en/fr/es/sv) rendering live incidents with severity and workflow badges
En la práctica
21:43 — 5xx-rate crosses 1 % over 5 minutes. Azure Monitor fires, the action group emails ops@ and posts to #ops-alerts. The Sev1 on-call acknowledges in three minutes, opens an IncidentLog with severity sev1, status investigating, public_message "We are investigating elevated error rates affecting login." The status page reflects it within seconds in four languages.
Twelve minutes later root cause is identified — a misbehaving downstream OAuth provider — they patch the timeout, update the incident to status monitoring with a new public_message, then resolved at 22:11. The next morning the Sev1 owner writes the blameless postmortem from the template: timeline pulled straight from IncidentLog updates, action items filed back into the backlog with named owners.
Funcionalidades de este subsistema
11| ID | Status | Funcionalidades |
|---|---|---|
| F16.15.01 | Entregado | Severity-matris Sev1/Sev2/Sev3 med eskaleringsvägar, responstider och SLA-formulering (docs/engineering/operations/incident-response.md). ✅ PL-T053 |
| F16.15.02 | Entregado | On-call-rotation dokumenterad (docs/engineering/operations/on-call-rotation.md) med schema, kontaktuppgifter och semesterhantering. ✅ PL-T053 |
| F16.15.03 | Entregado | Blameless postmortem-mall (docs/engineering/operations/postmortem-template.md) med timeline, root cause, kontribuerande faktorer och åtgärdslista. ✅ PL-T053 |
| F16.15.04 | Entregado | Azure Monitor action group ag-petanque-primary (e-post + Slack-webhook) provisionerad via infrastructure/monitoring.bicep. ✅ PL-T053 |
| F16.15.05 | Entregado | Sev1-alerts: 5xx-rate > 1 % / 5 min, Cosmos RU-throttling > 10 % / 5 min, ACS e-post bounce > 5 % / 15 min. ✅ PL-T053 |
| F16.15.06 | Entregado | Sev2-alerts: Container App restarts > 3/h, API-latens p95 > 2 s / 10 min, Stripe webhook-fel > 5 %. ✅ PL-T053 |
| F16.15.07 | Entregado | Deploy-script infrastructure/deploy-monitoring.sh för att applicera Bicep mot Azure. ✅ PL-T053 |
| F16.15.08 | Entregado | IncidentLog Beanie-modell med severity (sev1/sev2/sev3), workflow-status (investigating/identified/monitoring/resolved), public_message och updates-lista. ✅ PL-T053 |
| F16.15.09 | Entregado | GET /public/incidents?since=<iso> — returnerar senaste 30 dagars incidents, publika fält, sortering nyaste-först. ✅ PL-T053 |
| F16.15.10 | Entregado | POST /admin/incidents + PATCH /admin/incidents/{id} — platform-admin-only, skapar och uppdaterar incidents med severity-klassning. ✅ PL-T053 |
| F16.15.11 | Entregado | Status-sida (www/src/pages/status.astro + fr/es/sv-varianter) uppdaterad att hämta incident-historik från IncidentLog-collection via /public/incidents. Severity- och workflow-badges i UI. ✅ PL-T053 |
Subsistemas relacionados
Partes interesadas que necesitan este subsistema
Aparece en 1 análisis de partes interesadas