Nginx 502 Upstream Triage

← Back to ASE Projects

I simulated a customer-facing outage by misconfiguring the Nginx upstream (pointing to app:9999 instead of app:5000). The goal was to practice ticket-ready triage: reproduce the 502 with curl, see the impact in Grafana, use Loki logs to pinpoint the upstream issue, fix proxy_pass, and verify recovery—then summarize everything in a short runbook.

Stack

Docker Compose • Nginx • Flask • Postgres • Loki/Promtail • Grafana

What I Did

  • Injected an upstream misconfig to force 502s
  • Confirmed failures using curl and Grafana panels
  • Used Loki logs to find connect() failed (111: Connection refused) on /api/users
  • Reverted proxy_pass and verified 200 OK
  • Documented a simple Nginx 502 troubleshooting flow

Incident Timeline

  • Baseline: /api/users healthy behind Nginx
  • Change: upstream updated to app:9999
  • Impact: 502s and Nginx 5xx/min spike
  • Evidence: logs show upstream connect() failed
  • Fix: restore upstream to app:5000
  • Recovery: 200 OK, 5xx back to 0

Incident Response Story

1) Introduce Misconfig & Reproduce 502

I intentionally pointed Nginx at the wrong upstream (app:9999) and hit /api/users through the proxy. This mimics a config change that breaks routing in production. As expected, curl returned 502 Bad Gateway, matching what a customer would report.

Terminal: curl shows 502 Bad Gateway after upstream misconfig
After changing proxy_pass to http://app:9999, curl -i http://localhost:8080/api/users returns 502 Bad Gateway.

2) Detect Impact in Metrics & Logs

With the misconfig in place, I generated traffic and used Grafana to see the impact: Nginx 5xx/min spiked during the incident window, while Loki logs showed upstream connection failures on /api/users. This confirmed that the 502s were coming from Nginx, not the app or database.

Grafana: Nginx 5xx/min panel shows spike during incident
Grafana metric panel: Nginx 5xx/min spikes during the misconfigured upstream window.
Grafana logs: connect() failed while connecting to upstream, 502
Loki log evidence: connect() failed (111: Connection refused) and 502 on /api/users, showing Nginx can't reach the upstream container.

3) Fix Upstream & Verify Recovery

I restored proxy_pass back to http://app:5000, restarted Nginx, and re-ran the same curl checks. The endpoint returned 200 OK again, and 5xx counts dropped back to zero. Metrics and logs both showed a clean post-fix window.

Grafana: Nginx requests per minute
Requests/min panel in Grafana showing the test traffic I used before and after the fix.
Terminal: curl shows 200 OK after reverting upstream to app:5000
After restoring the upstream to app:5000, curl -i http://localhost:8080/api/users returns 200 OK, confirming recovery.

Key Commands Used

Shell

# Break it: misconfigure upstream and restart Nginx
sed -i 's|proxy_pass http://app:5000|proxy_pass http://app:9999|' nginx/default.conf
docker compose restart nginx
curl -i http://localhost:8080/api/users   # 502

# Fix it: restore upstream and restart Nginx
sed -i 's|proxy_pass http://app:9999|proxy_pass http://app:5000|' nginx/default.conf
docker compose restart nginx
curl -i http://localhost:8080/api/users   # 200

LogQL (Loki)

# Nginx requests/min
sum(count_over_time({compose_project="appsupportlab",compose_service="nginx"} [1m])) 

# Nginx 5xx/min
sum(count_over_time({compose_project="appsupportlab",compose_service="nginx"} |~ " 5\\d\\d " [1m]))

# Logs selector for evidence (filter in Explore)
{compose_project="appsupportlab",compose_service="nginx"}

Outcome & Prevention