Platform Status & Transparency
En resumen
The public status page is petanque.life's transparency commitment — real health checks against API, DB, Redis, Auth and SWA, a 60-second background collector with 90-day rolling uptime history, admin-published incidents with timeline updates, 30-second client refresh and an honest fail-red fallback when the status service itself is unreachable, served in 4 languages and linked from every footer.
Cómo funciona
Trust in a B2B/B2G platform survives outages only when the operator is transparent during them. The status page solves this by going beyond the typical green-square theatre. A background job, CollectStatusSamplesJob, runs every 60 seconds and probes each component the platform depends on: API health endpoint, MongoDB ping, Redis ping, Auth service, the static web apps.
Each probe writes a ServiceHealthSample document with status, latency and timestamp into a rolling-window collection with a 90-day TTL — long enough for SLA reporting, short enough to stay cheap. The visible /status page reads aggregated samples and renders per-service uptime over the last 24 hours, 7 days and 90 days, plus current state. Operators can publish StatusIncident records from the sys console with a title, severity, affected services and a timeline of updates — these render at the top of the status page during active incidents and below the fold afterwards as historical post-mortems.
The client-side fetcher refreshes every 30 seconds without a page reload so visitors always see fresh state, and — critically — if the fetch times out the UI fails red ('We can't reach the status service either') rather than silently showing stale green. The page is fully translated into EN/FR/ES/SV and is linked from the footer of every marketing page and from inside the admin app, so any user, customer or prospect who suspects an outage gets a same-second answer without escalating to support.
Capacidades clave
- Real health-checks against API, DB, Redis, Auth and SWA components
- 60-second background collector job recording rolling samples (90-day TTL)
- Per-service uptime aggregated over 24h, 7d and 90d windows
- Admin-publishable incidents with severity, affected services and update timeline
- 30-second client-side auto-refresh without a page reload
- Honest fail-red fallback when the status service itself is unreachable
- Linked from every marketing footer and inside the admin app, EN/FR/ES/SV
En la práctica
A federation IT contact gets a Slack ping that 'Petanque Life is down' from a club admin trying to issue a license. He opens petanque.life/status from his bookmarks. The page shows API: degraded latency, DB: operational, Auth: operational, with an active incident published 4 minutes ago: 'Elevated API latency in EU-North — investigating.' Two updates follow within the next 10 minutes.
He posts the status URL to his federation's Slack channel and the noise stops — everyone sees the same official source rather than competing speculation. The full incident closes with a post-mortem 90 minutes later, which becomes part of the visible history.
Funcionalidades de este subsistema
1| ID | Status | Funcionalidades |
|---|---|---|
| F19.12.01 | Entregado | Live status page with real health-checks (API, DB, Redis, Auth, SWA), 60s-interval background collector (CollectStatusSamplesJob), ServiceHealthSample rolling-window uptime (90d TTL), StatusIncident admin-reporting, client-side fetch with 30s auto-refresh and honest fail-red fallback on timeout, 4 languages ✅ PL-T045 |