← Home

Application Support Engineer Projects

Incident-driven reliability labs showcasing triage, logging/monitoring, and runbooks.

Database Credential Rotation Incident

Simulated a production outage caused by a Postgres password rotation where the application still used the old secret. Traced 500s on the users API to DB auth failures, rolled back the credential safely, and documented a rotation checklist to prevent repeats.

  • DB password rotation → 500s on /api/users
  • Log correlation of 5xx responses with Postgres auth failures
  • Mitigation via rollback or app-secret update + restart
  • Written DB credential rotation checklist
Status: Completed

API Key Misconfiguration — 401 Auth Outage

Added header-based API key auth to a Flask service behind Nginx, then intentionally broke the key to reproduce an “everything is 401 now” outage. Used Prometheus metrics and Grafana to spot the pattern and verify the fix.

  • Protected /api/users with an X-API-Key header
  • Custom metrics: app_requests_total, app_auth_success_total, app_auth_failures_total
  • Grafana panels for requests/sec, auth successes/sec, and failures/sec
  • Mini runbook for diagnosing and fixing 401 storms from bad/rotated API keys
Status: Completed

Nginx 502 Upstream Triage

Injected an upstream misconfig to force 502 Bad Gateway, observed impact, triaged via Loki/Grafana logs and metrics, and restored service with a documented runbook.

  • Nginx ↔ Flask API ↔ Postgres
  • Loki/Promtail + Grafana dashboards
  • Before/after proof with curl & 5xx metrics
Status: Completed