Database Credential Rotation Incident

← Back to ASE Projects

End-to-end incident response for a realistic outage: a database credential rotation occurred in Postgres while the application still used the old secret. The result was 500s on DB-backed routes. I scoped by timestamp, reproduced once, correlated logs, mitigated safely, validated recovery, and wrote a short rotation checklist to prevent repeats.

Stack

Docker • Nginx • Flask • Postgres • Linux

What I Did

Captured baseline behavior & timestamp window
Rotated the DB password to simulate an outage
Correlated 500s on /api/users with FATAL auth in app logs
Mitigated by restoring the secret or updating the app secret + restart
Validated 200s and a clean log window after recovery
Published a DB-secret rotation checklist

Incident Timeline

Baseline: routes 200
Rotate DB password → 500s on users API
Logs show Postgres authentication failures
Rollback/secret update → app restart
Recovery validated; log window clean

Incident Response Story

1) Baseline & Scope

Confirm all services are healthy and take a quick baseline (/api/users 200). Note the Date header / timestamp window to align evidence in logs and future requests.

Baseline: docker compose ps shows all services up — Baseline: services up before any credential changes.

2) Introduce Change → Reproduce Failure

Rotate the DB password in Postgres while the app still uses the old secret. DB-backed routes flip to 500; capture the failures in the same timestamp window as the credential change.

ALTER USER postgres WITH PASSWORD — credential rotation — Outage introduced: Postgres password is rotated while the app still has the old secret.

App logs: GET /users 500 and FATAL: password authentication failed — Failure window: `GET /api/users` returns 500 and app logs show `FATAL` password authentication failures from Postgres.

3) Mitigation

The quickest mitigation is to restore the previous credential so the app and DB match again.

Rollback command to restore credential — Mitigation: restore the original DB password

ALTER ROLE success confirmation after rollback — Rollback confirmed in Postgres: ALTER ROLE completed successfully.

4) Recovery Validation

Re-test /api/users to confirm 200s, and tail logs to ensure the window is clean (no new auth failures). Document the incident and add the rotation checklist so future password changes don't cause surprise outages.

Post-fix log tail: no new authentication failures — Clean window after fix: follow-up requests succeed and there are no new DB auth failures in the logs.

Key Commands Used

Repro & Evidence

# Baseline
curl -i http://localhost:8080/api/users

# Introduce outage (DB rotation only)
docker compose exec db psql -U postgres -d appdb -c "ALTER USER postgres WITH PASSWORD 'WrongNow#1';"

# Failure & logs (aligned by timestamp)
curl -i http://localhost:8080/api/users      # expect 500
docker compose logs --timestamps --tail=50 app | grep -Ei "FATAL|auth|psycopg2"

Mitigation & Validation

# Fast rollback
docker compose exec db psql -U postgres -d appdb -c "ALTER USER postgres WITH PASSWORD 'postgres';"

# OR rotate app secret to the new value, then:
docker compose up -d --build app

# Validate recovery
curl -i http://localhost:8080/api/users      # expect 200
docker compose logs --timestamps --since=2m app

Outcome & Prevention

Outage localized to a DB credential mismatch between app and Postgres; recovered quickly once secrets were aligned.
Added a DB-secret rotation checklist: update app secret → restart → smoke test → record timestamp.
Set a simple alert on 5xx/auth-fail spikes to catch this class of issues early in production.