diff --git a/PBS/Inbox/ufw-docker-outage-fix.md b/PBS/Inbox/ufw-docker-outage-fix.md new file mode 100644 index 0000000..6409c9b --- /dev/null +++ b/PBS/Inbox/ufw-docker-outage-fix.md @@ -0,0 +1,125 @@ +--- +project: ufw-docker-outage-fix +type: session-notes +status: completed +tags: + - pbs + - docker + - traefik + - production + - ufw + - security + - woocommerce +--- + +# Server Outage & UFW Docker Rules Fix + +## Summary + +Production site became unresponsive after a server reboot. Root cause was +incomplete UFW firewall rules in `/etc/ufw/after.rules` on production — +Docker containers had no outbound internet access. WordPress plugins making +external HTTP calls (WooCommerce, Jetpack, Yoast, etc.) were timing out on +every page load, causing 60-second render times. + +## Timeline + +- Server became unresponsive overnight, required Linode dashboard reboot +- Site loaded but extremely slowly (15s+, then timeouts) +- WordPress container showed 60-second homepage render time +- Static files served in ~89ms — confirmed PHP processing was the bottleneck +- MySQL processlist was clean — not a database issue +- Discovered WordPress container could not reach the internet (`curl +google.com` failed, `ping 8.8.8.8` 100% packet loss) +- Compared `DOCKER-USER` iptables chain between production and staging +- Production was missing three critical rules that staging had +- Root cause: `after.rules` on production had an older version of the +Docker firewall rules that was never updated after Ansible playbook +improvements + +## Root Cause + +Production `/etc/ufw/after.rules` was missing: + +``` +-A DOCKER-USER -m conntrack --ctstate RELATED,ESTABLISHED -j RETURN +-A DOCKER-USER -p udp -m udp --dport 53 -j RETURN +-A DOCKER-USER -p tcp -m tcp --dport 53 -j RETURN +-A DOCKER-USER -i docker+ -o eth0 -j RETURN +``` + +Without these rules, containers could receive inbound traffic but could not +initiate outbound connections. The site worked before the reboot because +Docker's own iptables rules provided outbound access — but on reboot, UFW +reloaded from `after.rules` and overwrote them with the incomplete ruleset. + +## Fix Applied + +1. Backed up production `after.rules`: `sudo cp /etc/ufw/after.rules +/etc/ufw/after.rules.backup.2026-03-22` +2. Replaced production `after.rules` with staging's version (which matches +current Ansible playbook) +3. Ran `sudo ufw reload` +4. Verified: `docker exec traefik ping -c 2 8.8.8.8` — 0% packet loss +5. Homepage render time: 60 seconds → 276 milliseconds + +## Additional Cleanup + +- Cleaned 8,555 failed Action Scheduler tasks from +`wp_actionscheduler_actions` table (caused by +`image-optimization/cleanup/stuck-operation` hook accumulating since +December 2025) +- Cleaned 1,728 completed actions +- Flushed Redis cache + +## Key Learnings + +- **UFW + Docker is fragile on reboot:** Docker's runtime iptables rules +can mask incomplete UFW `after.rules` config. Everything works until a +reboot wipes Docker's rules and UFW reasserts its own. +- **Always re-run Ansible after playbook changes:** The playbook was +updated with correct Docker rules but never re-applied to production. +Staging got the fix, production didn't. +- **Container outbound networking failure presents as slow PHP:** Plugins +making external HTTP calls block the entire page render while waiting for +connection timeouts. Looks like a performance problem but is actually a +networking problem. +- **Cold cache + broken networking = compounding failure:** After reboot, +no Redis cache + no opcode cache + plugins timing out on external calls = +catastrophic page load times. +- **WooCommerce was a red herring:** It added overhead but wasn't the root +cause. The real issue predated the WooCommerce install. + +## Action Items + +- [ ] Investigate which plugin registers +`image-optimization/cleanup/stuck-operation` and fix or remove it +- [ ] Audit Ansible playbook vs production state — identify other drift +- [ ] Consider running Ansible against production with `--check --diff` to +see what would change before applying +- [ ] Add a monitoring check for container outbound connectivity (e.g., +Uptime Kuma ping to external host from inside a container) +- [ ] Document WooCommerce memory impact: WordPress container went from +~300-400MB to ~728MB + +## Diagnostic Commands Used + +```bash +# Check per-container resources +docker stats --no-stream --format "table +{{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}" + +# Test PHP render time +time docker exec wordpress curl -s -o /dev/null -w "%{http_code}" +http://localhost/ + +# Test container outbound access +docker exec wordpress php -r "var_dump(file_get_contents('http://google.com +'));" + +# Compare DOCKER-USER iptables rules +sudo iptables -L DOCKER-USER -n -v + +# Check UFW after.rules +sudo cat /etc/ufw/after.rules | grep -A 20 "DOCKER-USER" +``` \ No newline at end of file