--- project: ufw-docker-outage-fix type: session-notes status: completed tags: - pbs - docker - traefik - production - ufw - security - woocommerce --- # Server Outage & UFW Docker Rules Fix ## Summary Production site became unresponsive after a server reboot. Root cause was incomplete UFW firewall rules in `/etc/ufw/after.rules` on production — Docker containers had no outbound internet access. WordPress plugins making external HTTP calls (WooCommerce, Jetpack, Yoast, etc.) were timing out on every page load, causing 60-second render times. ## Timeline - Server became unresponsive overnight, required Linode dashboard reboot - Site loaded but extremely slowly (15s+, then timeouts) - WordPress container showed 60-second homepage render time - Static files served in ~89ms — confirmed PHP processing was the bottleneck - MySQL processlist was clean — not a database issue - Discovered WordPress container could not reach the internet (`curl google.com` failed, `ping 8.8.8.8` 100% packet loss) - Compared `DOCKER-USER` iptables chain between production and staging - Production was missing three critical rules that staging had - Root cause: `after.rules` on production had an older version of the Docker firewall rules that was never updated after Ansible playbook improvements ## Root Cause Production `/etc/ufw/after.rules` was missing: ``` -A DOCKER-USER -m conntrack --ctstate RELATED,ESTABLISHED -j RETURN -A DOCKER-USER -p udp -m udp --dport 53 -j RETURN -A DOCKER-USER -p tcp -m tcp --dport 53 -j RETURN -A DOCKER-USER -i docker+ -o eth0 -j RETURN ``` Without these rules, containers could receive inbound traffic but could not initiate outbound connections. The site worked before the reboot because Docker's own iptables rules provided outbound access — but on reboot, UFW reloaded from `after.rules` and overwrote them with the incomplete ruleset. ## Fix Applied 1. Backed up production `after.rules`: `sudo cp /etc/ufw/after.rules /etc/ufw/after.rules.backup.2026-03-22` 2. Replaced production `after.rules` with staging's version (which matches current Ansible playbook) 3. Ran `sudo ufw reload` 4. Verified: `docker exec traefik ping -c 2 8.8.8.8` — 0% packet loss 5. Homepage render time: 60 seconds → 276 milliseconds ## Additional Cleanup - Cleaned 8,555 failed Action Scheduler tasks from `wp_actionscheduler_actions` table (caused by `image-optimization/cleanup/stuck-operation` hook accumulating since December 2025) - Cleaned 1,728 completed actions - Flushed Redis cache ## Key Learnings - **UFW + Docker is fragile on reboot:** Docker's runtime iptables rules can mask incomplete UFW `after.rules` config. Everything works until a reboot wipes Docker's rules and UFW reasserts its own. - **Always re-run Ansible after playbook changes:** The playbook was updated with correct Docker rules but never re-applied to production. Staging got the fix, production didn't. - **Container outbound networking failure presents as slow PHP:** Plugins making external HTTP calls block the entire page render while waiting for connection timeouts. Looks like a performance problem but is actually a networking problem. - **Cold cache + broken networking = compounding failure:** After reboot, no Redis cache + no opcode cache + plugins timing out on external calls = catastrophic page load times. - **WooCommerce was a red herring:** It added overhead but wasn't the root cause. The real issue predated the WooCommerce install. ## Action Items - [ ] Investigate which plugin registers `image-optimization/cleanup/stuck-operation` and fix or remove it - [ ] Audit Ansible playbook vs production state — identify other drift - [ ] Consider running Ansible against production with `--check --diff` to see what would change before applying - [ ] Add a monitoring check for container outbound connectivity (e.g., Uptime Kuma ping to external host from inside a container) - [ ] Document WooCommerce memory impact: WordPress container went from ~300-400MB to ~728MB ## Diagnostic Commands Used ```bash # Check per-container resources docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}" # Test PHP render time time docker exec wordpress curl -s -o /dev/null -w "%{http_code}" http://localhost/ # Test container outbound access docker exec wordpress php -r "var_dump(file_get_contents('http://google.com '));" # Compare DOCKER-USER iptables rules sudo iptables -L DOCKER-USER -n -v # Check UFW after.rules sudo cat /etc/ufw/after.rules | grep -A 20 "DOCKER-USER" ```