Incident-Response & Alert-Routing

F16.15 11 funcionalidades Planificado

En resumen

Formalised incident-response process for the platform: Sev1/Sev2/Sev3 severity matrix with escalation paths and SLA wording, documented on-call rotation, blameless postmortems, Azure Monitor action group provisioned via Bicep, concrete alert rules for 5xx rate, Cosmos RU throttling, ACS email bounces, container restarts, API latency and Stripe webhook errors, plus a public IncidentLog driving the multilingual live status page.

Cómo funciona

The severity matrix in docs/engineering/operations/incident-response.md formalises Sev1/Sev2/Sev3 with explicit escalation paths, response times and SLA wording so an on-call engineer at 3 a.m. does not have to invent triage. Sev1 is total or partial outage with revenue or data integrity impact and a 15-minute response target; Sev2 is degraded service with workaround; Sev3 is non-urgent. The on-call rotation in docs/engineering/operations/on-call-rotation.md publishes the schedule, contact details and vacation handling so paging never lands on an unreachable number. Every Sev1 and Sev2 closes with a blameless postmortem from the docs/engineering/operations/postmortem-template.md, including timeline, root cause, contributing factors and an action list with owners.

Alerting runs through Azure Monitor. Action group ag-petanque-primary (email + Slack webhook) is provisioned via infrastructure/monitoring.bicep and applied with infrastructure/deploy-monitoring.sh, so the channel topology is in version control. Sev1 alert rules cover 5xx-rate > 1 % over 5 minutes, Cosmos RU throttling > 10 % over 5 minutes and ACS email bounce rate > 5 % over 15 minutes — failures that demand immediate attention. Sev2 covers Container App restarts > 3/h, API latency p95 > 2 s over 10 minutes and Stripe webhook errors > 5 % — degraded but not bleeding.

The incident lifecycle is captured server-side in IncidentLog (a Beanie model) with severity, workflow status (investigating / identified / monitoring / resolved), public_message and an updates list. Platform admins create and update incidents through POST /admin/incidents and PATCH /admin/incidents/{id}; updates are append-only on the embedded list so the public timeline is never edited after the fact. GET /public/incidents?since=<iso> serves the last 30 days of public incidents, sorted newest-first, with only public fields exposed. The status page (www/src/pages/status.astro plus fr/es/sv variants) reads this endpoint and renders severity and workflow badges so users see the same picture the on-call sees, without authentication.

Capacidades clave

Sev1/Sev2/Sev3 severity matrix with escalation paths, response times and SLA wording
Documented on-call rotation with schedule, contact details and vacation handling
Blameless postmortem template covering timeline, root cause, contributing factors and action list
Azure Monitor action group provisioned via Bicep with email + Slack webhook delivery
Sev1 alert rules: 5xx-rate, Cosmos RU throttling, ACS email bounce rate
Sev2 alert rules: Container App restarts, API latency p95, Stripe webhook errors
IncidentLog model with severity, workflow status, public_message and append-only updates list
Platform-admin POST/PATCH /admin/incidents endpoints and public GET /public/incidents feed
Multilingual status page (en/fr/es/sv) rendering live incidents with severity and workflow badges

En la práctica

21:43 — 5xx-rate crosses 1 % over 5 minutes. Azure Monitor fires, the action group emails ops@ and posts to #ops-alerts. The Sev1 on-call acknowledges in three minutes, opens an IncidentLog with severity sev1, status investigating, public_message "We are investigating elevated error rates affecting login." The status page reflects it within seconds in four languages.

Twelve minutes later root cause is identified — a misbehaving downstream OAuth provider — they patch the timeout, update the incident to status monitoring with a new public_message, then resolved at 22:11. The next morning the Sev1 owner writes the blameless postmortem from the template: timeline pulled straight from IncidentLog updates, action items filed back into the backlog with named owners.

Funcionalidades de este subsistema

ID	Status	Funcionalidades
F16.15.01	Entregado	Severity-matris Sev1/Sev2/Sev3 med eskaleringsvägar, responstider och SLA-formulering (docs/engineering/operations/incident-response.md). ✅ PL-T053
F16.15.02	Entregado	On-call-rotation dokumenterad (docs/engineering/operations/on-call-rotation.md) med schema, kontaktuppgifter och semesterhantering. ✅ PL-T053
F16.15.03	Entregado	Blameless postmortem-mall (docs/engineering/operations/postmortem-template.md) med timeline, root cause, kontribuerande faktorer och åtgärdslista. ✅ PL-T053
F16.15.04	Entregado	Azure Monitor action group ag-petanque-primary (e-post + Slack-webhook) provisionerad via infrastructure/monitoring.bicep. ✅ PL-T053
F16.15.05	Entregado	Sev1-alerts: 5xx-rate > 1 % / 5 min, Cosmos RU-throttling > 10 % / 5 min, ACS e-post bounce > 5 % / 15 min. ✅ PL-T053
F16.15.06	Entregado	Sev2-alerts: Container App restarts > 3/h, API-latens p95 > 2 s / 10 min, Stripe webhook-fel > 5 %. ✅ PL-T053
F16.15.07	Entregado	Deploy-script infrastructure/deploy-monitoring.sh för att applicera Bicep mot Azure. ✅ PL-T053
F16.15.08	Entregado	IncidentLog Beanie-modell med severity (sev1/sev2/sev3), workflow-status (investigating/identified/monitoring/resolved), public_message och updates-lista. ✅ PL-T053
F16.15.09	Entregado	GET /public/incidents?since=<iso> — returnerar senaste 30 dagars incidents, publika fält, sortering nyaste-först. ✅ PL-T053
F16.15.10	Entregado	POST /admin/incidents + PATCH /admin/incidents/{id} — platform-admin-only, skapar och uppdaterar incidents med severity-klassning. ✅ PL-T053
F16.15.11	Entregado	Status-sida (www/src/pages/status.astro + fr/es/sv-varianter) uppdaterad att hämta incident-historik från IncidentLog-collection via /public/incidents. Severity- och workflow-badges i UI. ✅ PL-T053