Create ufw-docker-outage-fix.md via n8n
This commit is contained in:
parent
1897d2b64d
commit
66a6b55f20
125
PBS/Inbox/ufw-docker-outage-fix.md
Normal file
125
PBS/Inbox/ufw-docker-outage-fix.md
Normal file
@ -0,0 +1,125 @@
|
|||||||
|
---
|
||||||
|
project: ufw-docker-outage-fix
|
||||||
|
type: session-notes
|
||||||
|
status: completed
|
||||||
|
tags:
|
||||||
|
- pbs
|
||||||
|
- docker
|
||||||
|
- traefik
|
||||||
|
- production
|
||||||
|
- ufw
|
||||||
|
- security
|
||||||
|
- woocommerce
|
||||||
|
---
|
||||||
|
|
||||||
|
# Server Outage & UFW Docker Rules Fix
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Production site became unresponsive after a server reboot. Root cause was
|
||||||
|
incomplete UFW firewall rules in `/etc/ufw/after.rules` on production —
|
||||||
|
Docker containers had no outbound internet access. WordPress plugins making
|
||||||
|
external HTTP calls (WooCommerce, Jetpack, Yoast, etc.) were timing out on
|
||||||
|
every page load, causing 60-second render times.
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
- Server became unresponsive overnight, required Linode dashboard reboot
|
||||||
|
- Site loaded but extremely slowly (15s+, then timeouts)
|
||||||
|
- WordPress container showed 60-second homepage render time
|
||||||
|
- Static files served in ~89ms — confirmed PHP processing was the bottleneck
|
||||||
|
- MySQL processlist was clean — not a database issue
|
||||||
|
- Discovered WordPress container could not reach the internet (`curl
|
||||||
|
google.com` failed, `ping 8.8.8.8` 100% packet loss)
|
||||||
|
- Compared `DOCKER-USER` iptables chain between production and staging
|
||||||
|
- Production was missing three critical rules that staging had
|
||||||
|
- Root cause: `after.rules` on production had an older version of the
|
||||||
|
Docker firewall rules that was never updated after Ansible playbook
|
||||||
|
improvements
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
Production `/etc/ufw/after.rules` was missing:
|
||||||
|
|
||||||
|
```
|
||||||
|
-A DOCKER-USER -m conntrack --ctstate RELATED,ESTABLISHED -j RETURN
|
||||||
|
-A DOCKER-USER -p udp -m udp --dport 53 -j RETURN
|
||||||
|
-A DOCKER-USER -p tcp -m tcp --dport 53 -j RETURN
|
||||||
|
-A DOCKER-USER -i docker+ -o eth0 -j RETURN
|
||||||
|
```
|
||||||
|
|
||||||
|
Without these rules, containers could receive inbound traffic but could not
|
||||||
|
initiate outbound connections. The site worked before the reboot because
|
||||||
|
Docker's own iptables rules provided outbound access — but on reboot, UFW
|
||||||
|
reloaded from `after.rules` and overwrote them with the incomplete ruleset.
|
||||||
|
|
||||||
|
## Fix Applied
|
||||||
|
|
||||||
|
1. Backed up production `after.rules`: `sudo cp /etc/ufw/after.rules
|
||||||
|
/etc/ufw/after.rules.backup.2026-03-22`
|
||||||
|
2. Replaced production `after.rules` with staging's version (which matches
|
||||||
|
current Ansible playbook)
|
||||||
|
3. Ran `sudo ufw reload`
|
||||||
|
4. Verified: `docker exec traefik ping -c 2 8.8.8.8` — 0% packet loss
|
||||||
|
5. Homepage render time: 60 seconds → 276 milliseconds
|
||||||
|
|
||||||
|
## Additional Cleanup
|
||||||
|
|
||||||
|
- Cleaned 8,555 failed Action Scheduler tasks from
|
||||||
|
`wp_actionscheduler_actions` table (caused by
|
||||||
|
`image-optimization/cleanup/stuck-operation` hook accumulating since
|
||||||
|
December 2025)
|
||||||
|
- Cleaned 1,728 completed actions
|
||||||
|
- Flushed Redis cache
|
||||||
|
|
||||||
|
## Key Learnings
|
||||||
|
|
||||||
|
- **UFW + Docker is fragile on reboot:** Docker's runtime iptables rules
|
||||||
|
can mask incomplete UFW `after.rules` config. Everything works until a
|
||||||
|
reboot wipes Docker's rules and UFW reasserts its own.
|
||||||
|
- **Always re-run Ansible after playbook changes:** The playbook was
|
||||||
|
updated with correct Docker rules but never re-applied to production.
|
||||||
|
Staging got the fix, production didn't.
|
||||||
|
- **Container outbound networking failure presents as slow PHP:** Plugins
|
||||||
|
making external HTTP calls block the entire page render while waiting for
|
||||||
|
connection timeouts. Looks like a performance problem but is actually a
|
||||||
|
networking problem.
|
||||||
|
- **Cold cache + broken networking = compounding failure:** After reboot,
|
||||||
|
no Redis cache + no opcode cache + plugins timing out on external calls =
|
||||||
|
catastrophic page load times.
|
||||||
|
- **WooCommerce was a red herring:** It added overhead but wasn't the root
|
||||||
|
cause. The real issue predated the WooCommerce install.
|
||||||
|
|
||||||
|
## Action Items
|
||||||
|
|
||||||
|
- [ ] Investigate which plugin registers
|
||||||
|
`image-optimization/cleanup/stuck-operation` and fix or remove it
|
||||||
|
- [ ] Audit Ansible playbook vs production state — identify other drift
|
||||||
|
- [ ] Consider running Ansible against production with `--check --diff` to
|
||||||
|
see what would change before applying
|
||||||
|
- [ ] Add a monitoring check for container outbound connectivity (e.g.,
|
||||||
|
Uptime Kuma ping to external host from inside a container)
|
||||||
|
- [ ] Document WooCommerce memory impact: WordPress container went from
|
||||||
|
~300-400MB to ~728MB
|
||||||
|
|
||||||
|
## Diagnostic Commands Used
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check per-container resources
|
||||||
|
docker stats --no-stream --format "table
|
||||||
|
{{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
|
||||||
|
|
||||||
|
# Test PHP render time
|
||||||
|
time docker exec wordpress curl -s -o /dev/null -w "%{http_code}"
|
||||||
|
http://localhost/
|
||||||
|
|
||||||
|
# Test container outbound access
|
||||||
|
docker exec wordpress php -r "var_dump(file_get_contents('http://google.com
|
||||||
|
'));"
|
||||||
|
|
||||||
|
# Compare DOCKER-USER iptables rules
|
||||||
|
sudo iptables -L DOCKER-USER -n -v
|
||||||
|
|
||||||
|
# Check UFW after.rules
|
||||||
|
sudo cat /etc/ufw/after.rules | grep -A 20 "DOCKER-USER"
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue
Block a user