Operations
I korthet
The day-to-day operational toolbelt: a cross-tenant job inventory with per-job drill-in and SLA widgets, manual run-now with audit and `manual_run=false` guards, cooperative cancel/retry, a webhook failures inbox, rate-limit overrides, global feature flags, migration status, targeted Redis cache purge, and per-tenant maintenance mode.
Så fungerar det
Operations is the screen on the second monitor while the first runs the dashboard. The job inventory enumerates every batch job registered in `craft-easy-jobs` with its schedule (cron or on-demand), owner, description, last run, last status, and next scheduled run; it spans all tenants and reuses the per-tenant view from PL-T096. Drilling into a job shows the last N runs with duration, exit code, log tail, triggering operator (when manual), and affected tenant when tenant-scoped, alongside a 30-day failure trend chart.
Manual run-now triggers any registered job with optional parameters (target tenant, dry-run, time range), requires a reason, writes an audit row, and is hard-disabled for jobs marked `manual_run=false` in the registry — daily-rollup jobs that are not safe to double-fire are protected at the contract level rather than by operator discipline. Cancel aborts in-flight runs only when the job implements a cooperative cancel hook, so half-state never leaks; retry replays a failed run with the same or adjusted parameters. Both actions audit-log.
The job-queue depth and SLA widget flags jobs running beyond their target duration or piling up in the queue, surfacing them in the dashboard `Attention` panel. The webhook failures inbox aggregates failed outbound webhooks across all tenants with replay, suppress, and mark-handled actions. Rate-limit overrides temporarily lift limits for a tenant or IP — used during migrations, backfills, or incident triage — and auto-expire so they do not become permanent backdoors.
Global feature flags toggle platform-wide flags (versus the per-tenant overrides in F21.05). Database migrations status surfaces Beanie init state, pending index changes, and the last migration run timestamp so an operator can confirm a fresh deploy has settled. Cache invalidation does targeted Redis key or pattern purges, throttled to prevent thundering-herd.
Scheduled maintenance mode toggles a tenant into read-only or full-offline with a user-facing banner; the toggle is reversible and audited.
Centrala funktioner
- Cross-tenant job inventory with schedule, owner, last run, next run
- Per-job dashboard with last N runs, duration, exit code, log tail, 30-day failure trend
- Manual run-now with reason, audit, and `manual_run=false` registry guard
- Cooperative cancel and retry of in-flight or failed runs
- Job-queue depth and SLA widget feeding the dashboard attention panel
- Webhook failures inbox: replay, suppress, mark handled
- Auto-expiring rate-limit overrides per tenant or IP
- Global feature-flag toggles with audit (tenant overrides live in F21.05)
- Migration status and last-run timestamps; throttled Redis cache purge
- Per-tenant maintenance mode (read-only or offline) with user-facing banner
I praktiken
An operator notices the dashboard attention panel flagging two failed jobs. He opens the job inventory, drills into the first failure (`weekly-license-renewal-reminder`), reads the log tail, and finds a transient SendGrid 502. He clicks `Retry`, the job completes in 18 seconds, and the failure trend chart drops to zero.
The second failure is `daily-mrr-rollup`, marked `manual_run=false`; he cannot replay it from the UI, so he files an incident and pages finance for the consequence. While there he opens the webhook failures inbox, replays four payment-confirmation webhooks for a tenant that lost connectivity overnight, and lifts the tenant's rate limit for one hour to drain the backlog. The override carries an explicit expiry so it disappears on its own.
Features i detta subsystem
11| ID | Status | Funktioner |
|---|---|---|
| F21.08.01 | Levererad | Job inventory — every batch job registered in craft-easy-jobs is listed with its schedule (cron or on-demand), owner, description, last run, last status, next scheduled run. Cross-tenant scope. Builds on the PL-T096 admin job view; sys version spans all tenants. ✅ PL-T128 |
| F21.08.02 | Levererad | Per-job status dashboard — drill into one job: last N runs with duration, exit code, log tail, triggering user (if manual), affected tenant (if tenant-scoped). Failure trend chart over 30 d. Implemented |
| F21.08.03 | Levererad | Manual run-now — one-click trigger of a registered job with optional parameters (target tenant, dry-run flag, time range). Requires reason + audit entry. Disabled for jobs marked manual_run=false in the registry (e.g. daily-rollup jobs that are not safe to double-fire). Implemented |
| F21.08.04 | Levererad | Cancel / retry — abort an in-flight job (if the job implements a cooperative cancel hook) and retry a failed run with the same or adjusted parameters. Both actions audit-logged. Implemented |
| F21.08.05 | Levererad | Job queue depth + SLA widget — flags jobs running beyond target duration or piling up in the queue. Implemented |
| F21.08.06 | Levererad | Webhook failures inbox — failed outbound webhooks across all tenants. Replay, suppress, or mark handled. Implemented |
| F21.08.07 | Levererad | Rate-limit overrides — temporarily lift limits for a specific tenant or IP (during migration, backfill, or incident). Auto-expire. Implemented |
| F21.08.08 | Levererad | Feature flags (global) — toggle platform-wide flags with audit. Implemented |
| F21.08.09 | Levererad | Database migrations status — list Beanie init state, pending index changes, last migration run timestamp. Implemented |
| F21.08.10 | Levererad | Cache invalidation — targeted Redis key/pattern purge. Throttled. Implemented |
| F21.08.11 | Levererad | Scheduled maintenance mode — toggle a tenant into read-only or full-offline, with user-facing banner. Implemented |