Alert Runbooks
Use this page as the first-response index for production alerts. It is intentionally short: confirm the signal, identify blast radius, inspect the most recent deploy or dependency change, and only then move into service-specific remediation.
If an alert is caused by monitoring drift rather than a service failure, fix the metric or exporter path first so the alert pipeline remains trustworthy.
TrackingServiceDown
- Confirm
up{job="apexmail-tracking"}is0for the target and not just a scrape timeout. - Check tracking service restarts, recent deploys, and service discovery or proxy changes affecting the tracking endpoint.
- Inspect tracking logs before rollback so you can distinguish process crash, config failure, and network reachability issues.
HighApiLatency
- Verify the latency jump with request volume, error rate, and dependency health rather than treating p95 in isolation.
- Check for slow database queries, queue pressure, or recent feature flags that expanded request work.
- If saturation is real, scale or shed load before debugging individual slow handlers.
HighApiErrorRate
- Split the 5xx increase by route, tenant, and dependency so you know whether the failure is broad or localized.
- Compare with recent deploys, database errors, Redis failures, and upstream provider issues.
- If a new change correlates strongly, roll it back before deeper cleanup.
CriticalApiErrorRate
- Treat this as customer-impacting until disproven; establish whether core send or auth flows are failing.
- Check whether the same routes appear in
HighApiErrorRateand whether upstream services are already degraded. - Roll back or fail over quickly; detailed root-cause analysis comes after error-rate arrest.
HighCpuUsage
- Identify the hot service and confirm the increase is sustained rather than deploy-time warmup.
- Check request load, background jobs, and any tight retry loops or runaway parsing workloads.
- Scale out if needed, then capture profiles or representative stack traces while the issue is live.
HighMemoryUsage
- Confirm the process RSS growth is persistent and not just cache warmup or one-off batch activity.
- Check for queue backlog, oversized responses, and recent code paths retaining large payloads in memory.
- If the service is close to OOM, increase headroom or drain traffic before collecting heap evidence.
HighEventLoopLag
- Correlate lag with CPU saturation, blocking I/O, and synchronous work accidentally running on async paths.
- Check background jobs, webhook processing, and any new request fan-out that increased per-request blocking.
- Reduce concurrency or disable the offending path before it cascades into timeouts.
EmailQueueBacklog
- Confirm queue depth growth alongside worker throughput so you know whether demand or worker failure is driving it.
- Check downstream provider latency, worker health, and any paused or wedged delivery processors.
- If backlog continues climbing, scale workers or temporarily reduce intake.
CriticalEmailQueueBacklog
- Treat this as imminent delivery delay for customer traffic.
- Check whether workers are failing, providers are degraded, or queue storage is saturated.
- Stabilize throughput first, then communicate expected delivery delay if customer impact is material.
HighWebhookFailureRate
- Determine whether failures cluster by destination, tenant, or event type.
- Check retry backlog, downstream customer endpoint health, and any signing or payload-shape regressions.
- If the issue is limited to one destination, isolate it so global delivery remains healthy.
TrackingProcessingBacklog
- Confirm the tracking service is up but not processing meaningful events, not simply receiving zero traffic.
- Check ingestion queues, worker consumers, and any Redis or database dependency needed for event flow.
- Compare against recent frontend or redirect changes that could have stopped event generation at the source.
StripeWebhookFailures
- Inspect webhook signature validation, payload parsing, and billing-side database writes.
- Check whether Stripe is retrying the same events or whether new events are failing uniformly.
- If necessary, pause automated billing side effects until webhook correctness is restored.
PostgresDown
- Verify exporter reachability versus actual database reachability so you do not chase a monitoring-only issue.
- Check instance health, storage exhaustion, connection exhaustion, and recent schema or config changes.
- Recover primary availability first; only then investigate secondary effects in dependent services.
RedisDown
- Confirm whether Redis itself is unavailable or just the exporter path.
- Check authentication, secret loading, persistence errors, and container restart loops.
- Prioritize recovery for auth, queue, and tracking flows that depend on Redis in the request path.
ClickHouseDown
- Confirm whether the outage is limited to analytics ingestion or also affects user-visible reporting paths.
- Check disk, container health, and recent migration or schema rollout activity.
- If ingestion must pause, preserve source-of-truth data elsewhere until ClickHouse returns.
PostgresHighConnections
- Identify which services or queries are consuming the connection pool and whether idle connections are accumulating.
- Check for recent deploys that increased pool size, leaked transactions, or introduced retry storms.
- Reduce pressure before the database starts refusing new sessions.
LowDiskSpace
- Determine which filesystem is filling and whether the growth is logs, database data, queue state, or build artifacts.
- Stop nonessential writes and rotate or purge safe data first.
- If a stateful service is near exhaustion, expand storage before corruption or restart loops begin.
TlsCertificateExpiringSoon
- Confirm which hostname is expiring and whether automated renewal is expected to handle it.
- Check certificate issuance logs, DNS challenges, and reverse-proxy reload state.
- Renew early enough to avoid emergency rotation during an unrelated incident.