Nginx 502 Upstream Triage

← Back to ASE Projects

I simulated a customer-facing outage by misconfiguring the Nginx upstream (pointing to app:9999 instead of app:5000). The goal was to practice ticket-ready triage: reproduce the 502 with curl, see the impact in Grafana, use Loki logs to pinpoint the upstream issue, fix proxy_pass, and verify recovery—then summarize everything in a short runbook.

Stack

Docker Compose • Nginx • Flask • Postgres • Loki/Promtail • Grafana

What I Did

Injected an upstream misconfig to force 502s
Confirmed failures using curl and Grafana panels
Used Loki logs to find connect() failed (111: Connection refused) on /api/users
Reverted proxy_pass and verified 200 OK
Documented a simple Nginx 502 troubleshooting flow

Incident Timeline

Baseline: /api/users healthy behind Nginx
Change: upstream updated to app:9999
Impact: 502s and Nginx 5xx/min spike
Evidence: logs show upstream connect() failed
Fix: restore upstream to app:5000
Recovery: 200 OK, 5xx back to 0

Incident Response Story

1) Introduce Misconfig & Reproduce 502

I intentionally pointed Nginx at the wrong upstream (app:9999) and hit /api/users through the proxy. This mimics a config change that breaks routing in production. As expected, curl returned 502 Bad Gateway, matching what a customer would report.

Terminal: curl shows 502 Bad Gateway after upstream misconfig — After changing `proxy_pass` to `http://app:9999`, `curl -i http://localhost:8080/api/users` returns **502 Bad Gateway**.

2) Detect Impact in Metrics & Logs

With the misconfig in place, I generated traffic and used Grafana to see the impact: Nginx 5xx/min spiked during the incident window, while Loki logs showed upstream connection failures on /api/users. This confirmed that the 502s were coming from Nginx, not the app or database.

Grafana: Nginx 5xx/min panel shows spike during incident — Grafana metric panel: **Nginx 5xx/min** spikes during the misconfigured upstream window.

Grafana logs: connect() failed while connecting to upstream, 502 — Loki log evidence: `connect() failed (111: Connection refused)` and `502` on `/api/users`, showing Nginx can't reach the upstream container.

3) Fix Upstream & Verify Recovery

I restored proxy_pass back to http://app:5000, restarted Nginx, and re-ran the same curl checks. The endpoint returned 200 OK again, and 5xx counts dropped back to zero. Metrics and logs both showed a clean post-fix window.

Grafana: Nginx requests per minute — Requests/min panel in Grafana showing the **test traffic** I used before and after the fix.

Terminal: curl shows 200 OK after reverting upstream to app:5000 — After restoring the upstream to `app:5000`, `curl -i http://localhost:8080/api/users` returns **200 OK**, confirming recovery.

Key Commands Used

Shell

# Break it: misconfigure upstream and restart Nginx
sed -i 's|proxy_pass http://app:5000|proxy_pass http://app:9999|' nginx/default.conf
docker compose restart nginx
curl -i http://localhost:8080/api/users   # 502

# Fix it: restore upstream and restart Nginx
sed -i 's|proxy_pass http://app:9999|proxy_pass http://app:5000|' nginx/default.conf
docker compose restart nginx
curl -i http://localhost:8080/api/users   # 200

LogQL (Loki)

# Nginx requests/min
sum(count_over_time({compose_project="appsupportlab",compose_service="nginx"} [1m])) 

# Nginx 5xx/min
sum(count_over_time({compose_project="appsupportlab",compose_service="nginx"} |~ " 5\\d\\d " [1m]))

# Logs selector for evidence (filter in Explore)
{compose_project="appsupportlab",compose_service="nginx"}

Outcome & Prevention

Localized the outage to an Nginx upstream misconfiguration, not the app code or database layer.
Practiced a repeatable pattern: reproduce with curl → check metrics → check logs → fix config → re-test.
Captured a lightweight runbook for Nginx 502s: verify proxy_pass, look for connect() failed in logs, and confirm recovery with a post-fix smoke test.