Sys Dashboard
En resumen
The console landing page that compresses platform health into one screen: KPI tiles, an attention panel, top-5 lists, a personal shortcut pane, a tenant×service system map, a live PlatformEvent firehose with pause/pin, and a correlation engine that suggests incidents the moment errors cluster.
Cómo funciona
The dashboard is the first screen any sys role lands on, so it is engineered for low-noise glance-ability and one-click drill-in. KPI tiles (active tenants, DAU/MAU, MRR, 30-day uptime, open incidents, queued jobs) are served by `GET /sys/dashboard/kpis` with a 60-second cache and route directly into their detail views; sparklines render where useful (DAU/MAU carries `[DAU, MAU]`). The `Attention needed` panel aggregates four fail-safe rows (open support tickets, unmatched payments, failed jobs in the last 24 hours, stale incidents older than 24 hours), each carrying a severity and a `resolve_href` so an operator can click straight into the queue.
Missing collections silently return zero, never blocking the page. Top-5 cards list the most active tenants by 7-day DAU, highest MRR, most 5xx errors over 24 hours, and longest job runtimes; empty lists render the card in a `no data yet` state so the grid never reflows. A personal shortcut pane backed by `SysUserPreferences.dashboard_shortcuts` lets each operator pin 0–8 deep links; PUT semantics are full-replace, hrefs must start with `/`, and reorder is RN-compatible up/down rather than HTML5 drag-and-drop.
The system map renders a tenant × service health matrix with cells coloured green/yellow/red from incidents, 5xx rate, service status, and latency vs SLA; a zoom-out mode collapses tenants to a single platform-health row, and cells drill into tenant pages, the firehose pre-filtered, or open incidents. The PlatformEvent firehose is the showpiece: an SSE stream over a 30-day TTL time-series collection, filterable by service, severity, tenant, and free text, with pause/resume, pin-for-later (cap 20), and a bounded in-process broker that drops on overflow rather than blocking. It mounts default-paused so an operator who walks away does not flood the network.
A rolling 60-second correlation window watches each service; more than ten error or critical events in the window emits an `incident_suggest` frame into the firehose with a one-click `Create incident` button that pre-populates the wizard.
Capacidades clave
- KPI tiles (tenants, DAU/MAU, MRR, uptime, incidents, queued jobs) with sparklines and drill-in
- Attention panel: tickets, unmatched payments, failed jobs, stale incidents — fail-safe aggregations
- Top-5 lists for tenants, revenue, 5xx errors, longest job runtimes
- Personal pinned shortcuts (0–8) with full-replace PUT and internal-only href validation
- System map: tenant × service health matrix with zoom-out and contextual drill-in
- PlatformEvent SSE firehose with filter, pause/resume, pin-for-later, bounded broker
- Correlation engine emits `incident_suggest` and pre-populates the incident wizard
En la práctica
A sys engineer starts her shift and lands on the dashboard. The KPI strip shows uptime at 99.94 percent and the attention panel surfaces three failed jobs and two stale incidents. She clicks the failed-jobs row, retries one, and reassigns another.
Back on the dashboard the system map cell for `api × tenant-fr` flashes yellow; she zooms in, sees a 5xx spike, and opens the firehose pre-filtered to that tenant. The correlation engine emits an `incident_suggest` frame two minutes later — twelve critical events in 60 seconds. She clicks `Create incident`, the wizard opens with affected services and tenant pre-filled, and she pages on-call within ten seconds of the cluster forming.
Funcionalidades de este subsistema
8| ID | Status | Funcionalidades |
|---|---|---|
| F21.04.01 | Entregado | KPI tiles: active tenants, DAU/MAU, MRR, 30-day uptime %, open incidents, queued jobs. Each tile is clickable and routes to the detail view. Sparkline where useful (DAU/MAU carries [DAU, MAU]). ✅ PL-T124 |
| F21.04.02 | Entregado | Realtime activity feed — 100-event backfill from sys_audit_entries + live pushes via in-process ActivityBroker, served as SSE with 15-s heartbeat. Event schema: {id, action, actor_id, target_type, target_id, tenant_id, timestamp, metadata}. Implemented (PL-T124) |
| F21.04.03 | Entregado | "Attention needed" panel — four aggregated rows (open support tickets, unmatched payments, failed jobs 24 h, stale incidents > 24 h) with severity + resolve_href. Aggregations are fail-safe: missing collections return 0. Implemented (PL-T124) |
| F21.04.04 | Entregado | Top-5 lists — most active tenants (7 d DAU), highest revenue (MRR), most 5xx errors (24 h), longest job runtimes (24 h). Empty lists render the card in a "no data yet" state so the grid stays stable. Implemented (PL-T124) |
| F21.04.05 | Entregado | Personal shortcut pane (0–8 items) backed by SysUserPreferences.dashboard_shortcuts. Full-replace PUT semantics; hrefs must start with / (internal sys routes only) or the API returns 400 SysShortcutInvalidHref. ↑/↓ reorder in-place (RN-compatible alternative to HTML5 DnD). Implemented (PL-T124) |
| F21.04.06 | Entregado | System-map — tenant × service health matrix with live green/yellow/red cell colouring (incidents, 5xx rate, service status, latency vs SLA). Zoom-out mode collapses tenants to a single platform-health row. Cells drill into tenant, firehose (pre-filtered), or open incident. ✅ PL-T146 |
| F21.04.07 | Entregado | Platform-event firehose — SSE stream over PlatformEvent (time-series, 30 d TTL). Filters by service/severity/tenant/search. Pause/resume, pin-for-later (cap 20), bounded in-process broker with drop-on-overflow. Default-paused on mount. ✅ PL-T146 |
| F21.04.08 | Entregado | Correlation engine — rolling 60 s window per service; > 10 error/critical events emits an incident_suggest frame into the firehose with a one-click "Create incident" button that pre-populates the wizard. ✅ PL-T146 |