Disaster Recovery & Restore Drill
En bref
Documented disaster-recovery plan with RPO 4 h / RTO 8 h for Cosmos DB, automated monthly restore drill via GitHub Actions cron that exercises point-in-time restore on real Azure resources, smoke-test verification against production counts within ±1 %, a dedicated drill service principal with minimum-permission custom role, and rotation runbooks for every critical credential including Key Vault, Stripe and Azure Communication Services.
Comment ça fonctionne
The DR plan in docs/engineering/architecture/07a-disaster-recovery.md is a scenario matrix — data corruption, region-down, credential leak, ransomware, delete-by-mistake — with per-scenario detection → containment → recovery → post-mortem flow and the exact az cosmosdb mongodb database restore commands that the on-call needs at 3 a.m. RPO is 4 hours, RTO is 8 hours, and the plan is reviewed against drill outcomes recorded in infrastructure/drill-history.md.
The monthly restore drill runs as a GitHub Actions workflow on cron 0 4 1-7 * 1 (the first Monday of each month at 04:00 UTC). It uses Cosmos DB's point-in-time restore to spin up petanque-drill-<timestamp>, runs tools/dr-drill-smoke.py over the recovered database, compares document counts in tenants, users, licenses, invoices and tenant_subscriptions against production within a ±1 % delta tolerance, deletes the temporary database and posts a Slack report with PASS/FAIL. Exit codes are explicit: 0 PASS, 1 FAIL, 2 prereq-fail, so an ops dashboard or PagerDuty integration can react cleanly.
The drill runs under sp-petanque-dr-drill, a dedicated Azure service principal with a custom role Petanque DR Drill Operator scoped to exactly the Cosmos DB permissions needed for restore and read — nothing more. It is created through infrastructure/scripts/create-dr-drill-sp.sh so re-provisioning is reproducible. The drill never touches production data; the restored database is named with a timestamp and torn down at the end of every run.
Key credential rotation has documented runbooks. Key Vault soft-delete and purge-protection are on, with az keyvault secret backup/restore procedures captured. Stripe key rotation under credential leak is az containerapp secret set followed by az containerapp update, ensuring the API picks up the new value without losing in-flight requests. Azure Communication Services connection-string rotation follows the same pattern with az communication regenerate-key. Every drill run updates the RPO/RTO baseline so the team can see drift over time and renegotiate the SLO with confidence in the data, not the wishful spec.
Capacités clés
- DR runbook with scenario matrix (data corruption, region down, credential leak, ransomware, delete-by-mistake)
- Documented RPO 4 h / RTO 8 h with exact az cosmosdb mongodb restore commands
- Automated monthly restore drill via GitHub Actions cron (first Monday 04:00 UTC)
- Smoke-test script comparing tenants, users, licenses, invoices and subscriptions against production within ±1 %
- Dedicated sp-petanque-dr-drill service principal with custom minimum-permission role
- Documented Key Vault backup, Stripe key rotation and ACS connection-string rotation runbooks
- RPO/RTO baseline auto-updated in infrastructure/drill-history.md after every drill
En pratique
First Monday of the month, 04:00 UTC. The DR drill workflow kicks off, restores petanque-drill-2026-04-06 from a PITR point exactly four hours back, runs the smoke-test script and posts to Slack: PASS, deltas within 0.3 %, total restore + verify time 47 minutes — well inside the 8-hour RTO. The drill-history file updates and the temporary database is torn down before breakfast.
Three weeks later a junior engineer accidentally drops a collection on staging. The on-call opens the runbook, picks delete-by-mistake, runs the documented restore command against the production timestamp seven minutes before the drop, swaps the collection name back in, and the data is whole again — because they have done this exact dance, on real Azure resources, every month for a year.
Fonctionnalités de ce sous-système
8| ID | Status | Fonctionnalités |
|---|---|---|
| F16.14.01 | Livré | DR-runbook med scenariomatris (data-corruption, region-down, credential-leak, ransomware, delete-by-mistake). Per scenario: detection → containment → recovery → post-mortem. Exakta az cosmosdb mongodb database restore-kommandon. RPO 4h / RTO 8h. ✅ PL-T052 |
| F16.14.02 | Livré | Automatiserad månatlig restore-drill (GitHub Actions cron 0 4 1-7 * 1). Skapar petanque-drill-<timestamp> via PITR, kör smoke-queries, jämför counts mot prod (±1 %), raderar temp-DB, skickar Slack-rapport. ✅ PL-T052 |
| F16.14.03 | Livré | Smoke-test-script (tools/dr-drill-smoke.py) — queries tenants, users, licenses, invoices, tenant_subscriptions. Delta-gräns ±1 %. Exit code 0 = PASS, 1 = FAIL, 2 = prereq-fel. ✅ PL-T052 |
| F16.14.04 | Livré | Service principal sp-petanque-dr-drill med custom role Petanque DR Drill Operator (minimala Cosmos DB-rättigheter). Skapas via infrastructure/scripts/create-dr-drill-sp.sh. ✅ PL-T052 |
| F16.14.05 | Livré | Key Vault-backup-rutin dokumenterad (soft-delete + purge-protection, az keyvault secret backup/restore). ✅ PL-T052 |
| F16.14.06 | Livré | Stripe-nyckelrotationsprocedur dokumenterad (az containerapp secret set + az containerapp update vid credential-läcka). ✅ PL-T052 |
| F16.14.07 | Livré | ACS connection-string-rotationsprocedur dokumenterad (az communication regenerate-key + Container App-uppdatering). ✅ PL-T052 |
| F16.14.08 | Livré | RPO/RTO baseline i infrastructure/drill-history.md — uppdateras automatiskt efter varje drill-körning. ✅ PL-T052 |