Application Support Engineer Projects

Incident-driven reliability labs showcasing triage, logging/monitoring, and runbooks.

Database Credential Rotation Incident

Simulated a production outage caused by a Postgres password rotation where the application still used the old secret. Traced 500s on the users API to DB auth failures, rolled back the credential safely, and documented a rotation checklist to prevent repeats.

DB password rotation → 500s on /api/users
Log correlation of 5xx responses with Postgres auth failures
Mitigation via rollback or app-secret update + restart
Written DB credential rotation checklist

Status: Completed

API Key Misconfiguration — 401 Auth Outage

Added header-based API key auth to a Flask service behind Nginx, then intentionally broke the key to reproduce an “everything is 401 now” outage. Used Prometheus metrics and Grafana to spot the pattern and verify the fix.

Protected /api/users with an X-API-Key header
Custom metrics: app_requests_total, app_auth_success_total, app_auth_failures_total
Grafana panels for requests/sec, auth successes/sec, and failures/sec
Mini runbook for diagnosing and fixing 401 storms from bad/rotated API keys

Status: Completed

Nginx 502 Upstream Triage

Injected an upstream misconfig to force 502 Bad Gateway, observed impact, triaged via Loki/Grafana logs and metrics, and restored service with a documented runbook.

Nginx ↔ Flask API ↔ Postgres
Loki/Promtail + Grafana dashboards
Before/after proof with curl & 5xx metrics

Status: Completed