pbs-obsidian-vault/PBS/Inbox/ufw-docker-outage-fix.md

4.5 KiB

project type status tags
ufw-docker-outage-fix session-notes completed
pbs
docker
traefik
production
ufw
security
woocommerce

Server Outage & UFW Docker Rules Fix

Summary

Production site became unresponsive after a server reboot. Root cause was incomplete UFW firewall rules in /etc/ufw/after.rules on production — Docker containers had no outbound internet access. WordPress plugins making external HTTP calls (WooCommerce, Jetpack, Yoast, etc.) were timing out on every page load, causing 60-second render times.

Timeline

  • Server became unresponsive overnight, required Linode dashboard reboot
  • Site loaded but extremely slowly (15s+, then timeouts)
  • WordPress container showed 60-second homepage render time
  • Static files served in ~89ms — confirmed PHP processing was the bottleneck
  • MySQL processlist was clean — not a database issue
  • Discovered WordPress container could not reach the internet (curl google.com failed, ping 8.8.8.8 100% packet loss)
  • Compared DOCKER-USER iptables chain between production and staging
  • Production was missing three critical rules that staging had
  • Root cause: after.rules on production had an older version of the Docker firewall rules that was never updated after Ansible playbook improvements

Root Cause

Production /etc/ufw/after.rules was missing:

-A DOCKER-USER -m conntrack --ctstate RELATED,ESTABLISHED -j RETURN
-A DOCKER-USER -p udp -m udp --dport 53 -j RETURN
-A DOCKER-USER -p tcp -m tcp --dport 53 -j RETURN
-A DOCKER-USER -i docker+ -o eth0 -j RETURN

Without these rules, containers could receive inbound traffic but could not initiate outbound connections. The site worked before the reboot because Docker's own iptables rules provided outbound access — but on reboot, UFW reloaded from after.rules and overwrote them with the incomplete ruleset.

Fix Applied

  1. Backed up production after.rules: sudo cp /etc/ufw/after.rules /etc/ufw/after.rules.backup.2026-03-22
  2. Replaced production after.rules with staging's version (which matches current Ansible playbook)
  3. Ran sudo ufw reload
  4. Verified: docker exec traefik ping -c 2 8.8.8.8 — 0% packet loss
  5. Homepage render time: 60 seconds → 276 milliseconds

Additional Cleanup

  • Cleaned 8,555 failed Action Scheduler tasks from wp_actionscheduler_actions table (caused by image-optimization/cleanup/stuck-operation hook accumulating since December 2025)
  • Cleaned 1,728 completed actions
  • Flushed Redis cache

Key Learnings

  • UFW + Docker is fragile on reboot: Docker's runtime iptables rules can mask incomplete UFW after.rules config. Everything works until a reboot wipes Docker's rules and UFW reasserts its own.
  • Always re-run Ansible after playbook changes: The playbook was updated with correct Docker rules but never re-applied to production. Staging got the fix, production didn't.
  • Container outbound networking failure presents as slow PHP: Plugins making external HTTP calls block the entire page render while waiting for connection timeouts. Looks like a performance problem but is actually a networking problem.
  • Cold cache + broken networking = compounding failure: After reboot, no Redis cache + no opcode cache + plugins timing out on external calls = catastrophic page load times.
  • WooCommerce was a red herring: It added overhead but wasn't the root cause. The real issue predated the WooCommerce install.

Action Items

  • Investigate which plugin registers image-optimization/cleanup/stuck-operation and fix or remove it
  • Audit Ansible playbook vs production state — identify other drift
  • Consider running Ansible against production with --check --diff to see what would change before applying
  • Add a monitoring check for container outbound connectivity (e.g., Uptime Kuma ping to external host from inside a container)
  • Document WooCommerce memory impact: WordPress container went from ~300-400MB to ~728MB

Diagnostic Commands Used

# Check per-container resources
docker stats --no-stream --format "table
{{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# Test PHP render time
time docker exec wordpress curl -s -o /dev/null -w "%{http_code}"
http://localhost/

# Test container outbound access
docker exec wordpress php -r "var_dump(file_get_contents('http://google.com
'));"

# Compare DOCKER-USER iptables rules
sudo iptables -L DOCKER-USER -n -v

# Check UFW after.rules
sudo cat /etc/ufw/after.rules | grep -A 20 "DOCKER-USER"