10 KiB
| project | type | status | tags | created | updated | path | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| server-stability-and-security-hardening | session-notes | active |
|
2026-03-23 | 2026-03-23 | PBS/Tech/Sessions/ |
Server Stability, Security Hardening & Staging Fixes - March 23, 2026
Session Summary
Marathon session covering three major areas: (1) production server crash investigation and MySQL/WordPress memory capping, (2) staging Traefik upgrade and debugging, and (3) Cloudflare security and caching improvements. Two server crashes in 48 hours traced to MySQL OOM kills, with a third event tonight traced to WordPress memory bloat caused by bot traffic bursts. All three issues now mitigated with layered defenses.
Part 1: Production — MySQL OOM Investigation & Fix
Root Cause Confirmed
Both crashes (Saturday 3/22 ~6AM ET, Monday 3/23 ~6:20AM ET) were caused by MySQL being OOM-killed by the Linux kernel. Confirmed via journalctl:
- Saturday:
Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB - Monday:
Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB - Both followed same pattern: MySQL OOM-killed → Docker restarts → system still starved → swapoff killed → cascading failure → manual Linode reboot
Server Timezone Note
Production server runs in UTC. Subtract 4 hours for Eastern time. Both crashes appeared as ~10AM UTC in logs but were ~6AM Eastern.
Journal Persistence Confirmed
/var/log/journalexists and journals survive rebootsjournalctl --list-bootsshows 5 boot sessions back to May 2025- For large time ranges, use
--since/--untilflags to avoid hanging
Investigation Results
- WooCommerce Action Scheduler: Cleared — all tasks showed completed status
- Wordfence Scans: Scan log showed ~1 minute scan on 3/19 at 10PM ET — doesn't align with crash window; scan schedule is automatic on free tier (no manual control)
- htop threads: Multiple MySQL rows in htop are threads, not processes — press
Hto toggle thread view
MySQL Memory Cap Applied
Added to mysql service in /opt/docker/wordpress/compose.yml:
mysql:
image: mysql:8.0
container_name: wordpress_mysql
restart: unless-stopped
deploy:
resources:
limits:
memory: 768M
reservations:
memory: 256M
command: >-
--default-authentication-plugin=mysql_native_password
--innodb-buffer-pool-size=256M
--innodb-log-buffer-size=16M
--max-connections=50
--key-buffer-size=16M
--tmp-table-size=32M
--max-heap-table-size=32M
--table-open-cache=256
--performance-schema=OFF
Key tuning notes:
performance-schema=OFFsaves ~200-400MB alonemax-connections=50reduced from default 151innodb-buffer-pool-size=256Mcaps InnoDB's biggest memory consumer
Result: MySQL dropped from 474MB (uncapped) to ~225MB (capped at 768MB, using 29% of cap)
Memory Monitoring Script Deployed
Created /usr/local/bin/docker-mem-log.sh — logs per-container memory every 5 minutes:
#!/bin/bash
LOG_FILE="/var/log/pbs-monitoring/container-memory.log"
echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE"
Cron: /etc/cron.d/docker-mem-monitor
*/5 * * * * root /usr/local/bin/docker-mem-log.sh
Check with: tail -20 /var/log/pbs-monitoring/container-memory.log
Part 2: Production — WordPress Memory Spike & Bot Traffic Discovery
Memory Monitoring Pays Off
The monitoring script caught a WordPress memory spike in real time:
| Time (UTC) | WordPress | MySQL |
|---|---|---|
| 02:15 | 1.12 GB | 245 MB |
| 02:20 | 2.34 GB | 178 MB |
| 02:30 | 2.91 GB | 141 MB |
Root Cause: Bot Traffic Burst
WordPress access logs at 02:16:59 UTC showed ~10+ simultaneous requests in 3 seconds:
- Multiple IPs hitting homepage simultaneously via Cloudflare
- Requests for random
.flacand.webmfiles (classic bot probing) - All using
http://referrer (nothttps://) — not legitimate traffic - Mix of spoofed user agents designed to look like different browsers
- Each uncached request spawned a PHP process, causing WordPress to spike to 2.9GB
WordPress Memory Cap Applied
Added to wordpress service in /opt/docker/wordpress/compose.yml:
deploy:
resources:
limits:
memory: 2000M
Result: WordPress now capped at ~2GB, currently running at ~866MB (43% of cap)
Cloudflare Traffic Analysis
24-hour stats showed 11.72k total requests with 10.4k uncached (89%). Two visible traffic spikes aligned with crash events.
Part 3: Cloudflare Security & Caching Hardening
Security Changes
- Bot Fight Mode — Enabled (Security → Settings)
- WAF Rule: Block suspicious file probes — Blocks requests ending in
.flac,.webm,.exe,.dll - Rate Limiting Rule: Homepage spam — 30 requests per 10 seconds per IP, blocks for 10 seconds
Caching Changes
- Browser Cache TTL — Increased from 4 hours to 1 day
- Always Online — Enabled (serves cached pages when server is down)
- Cache Rule — Applied Cloudflare "Cache Everything" template:
- Cache eligibility: Eligible for cache
- Edge TTL: Overrides origin cache-control headers
- Browser TTL: Set
- Serve stale while revalidating: Enabled
Important: After publishing new content, purge cache via Caching → Configuration → Purge Cache
Part 4: Staging — Traefik Upgrade & Debugging
Docker API Version Mismatch
apt-get upgrade on staging updated Docker Engine to v29.2.1 (API v1.53, minimum client API v1.44). Traefik v3.5's built-in Docker client only spoke API v1.24 → Docker rejected all Traefik requests → entire site down.
Fix: Updated Traefik from v3.5 to v3.6.11
- v3.6.11 includes Docker API auto-negotiation fix
- Also patches 3 CVEs (CVE-2026-32595, CVE-2026-32305, CVE-2026-32695)
Production impact: Must update Traefik on production before running apt-get upgrade, or the same break will occur. Update Traefik first, then Docker.
WordPress Unhealthy Container Issue
After Traefik upgrade, WordPress showed as "unhealthy" → Traefik v3.6 respects Docker health status and skips unhealthy containers → site returned 404.
Root cause: MySQL .env password contained $ character, which Docker compose interprets as variable substitution. Password was silently corrupted → WordPress couldn't connect to MySQL → healthcheck failed → Traefik wouldn't route.
Fix: Escaped $ characters in .env file. For future reference: $ must be doubled ($$) in Docker .env files.
Lesson: Traefik v3.6+ skips unhealthy containers entirely — they won't show up as routers in the dashboard.
PBS Manager Web App (Staging)
- Healthcheck using
curlfails onpython:3.13-slim(curl not installed) - Fix: Use Python-based healthcheck instead:
healthcheck:
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:5000/api/health')"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
- Code changes require
docker compose up -d --build(not just--force-recreate) - SQLAlchemy models must stay in sync with database schema changes
Layered Defense Summary
| Layer | What It Does | Status |
|---|---|---|
| Cloudflare Bot Fight Mode | Auto-blocks known bots | ✅ Enabled |
| Cloudflare WAF rules | Blocks file probes (.flac, .webm, .exe, .dll) | ✅ Deployed |
| Cloudflare Rate Limiting | 30 req/10s per IP on homepage | ✅ Deployed |
| Cloudflare Caching | Cache everything, serve stale while revalidating | ✅ Deployed |
| Cloudflare Always Online | Serves cached site during outages | ✅ Enabled |
| WordPress memory cap | 2GB limit prevents runaway PHP | ✅ Applied |
| MySQL memory cap | 768MB limit with tuned buffers | ✅ Applied |
| Memory monitoring | Logs per-container stats every 5 min | ✅ Running |
| Journal persistence | OOM kill logs survive reboots | ✅ Confirmed |
Current Production Memory Snapshot (post-fixes)
| Container | Memory | Limit | % of Limit |
|---|---|---|---|
| wordpress | 866 MB | 2,000 MB | 43% |
| n8n | 341 MB | System | 9% |
| wordpress_mysql | 190 MB | 768 MB | 25% |
| uptime-kuma | 124 MB | System | 3% |
| traefik | 56 MB | System | 1% |
| redis | 17 MB | 640 MB | 3% |
| wpcron | 16 MB | System | <1% |
| pbs-api | 14 MB | System | <1% |
| Total | ~1.62 GB |
Still Open
- Monitor overnight stability — check memory logs tomorrow AM
- Monitor Cloudflare cache hit rate over next 24 hours (should improve dramatically)
- Add log rotation for
/var/log/pbs-monitoring/container-memory.log - Update Traefik on production to v3.6.11 before running
apt-get upgrade - Disable
apt-daily.serviceon production (automatic unattended updates) - Investigate Cloudflare cache hit rate for wp-admin bypass if admin pages serve stale content
- Server sizing discussion still open — 4GB may be tight for Gitea + Authelia
- PBS Manager web app healthcheck and basicauth fixes on staging
- Consider Watchtower on staging only as a canary (discussed and decided against for production)
Key Learnings
- Docker
.envfiles treat$as variable substitution — double it ($$) or avoid$in passwords entirely - Traefik v3.6+ skips unhealthy containers — if a container's healthcheck fails, Traefik won't route to it (no error, just missing from dashboard)
docker compose up -d --force-recreateonly recreates from existing image; use--buildfor code changes- Docker API versions ≠ Docker product versions — API v1.24 vs v1.44 are protocol versions, not Docker Engine versions
performance-schema=OFFin MySQL saves ~200-400MB with no downside for WordPress- 89% uncached Cloudflare traffic was caused by WordPress sending
no-cacheheaders — override with Edge TTL rule - Bot traffic patterns: simultaneous requests from multiple IPs, random file probes,
http://referrers, mixed user agents - Memory monitoring script proved its value immediately — caught WordPress spike in real time
- Watchtower not recommended for production — prefer deliberate manual updates tested on staging first
- Always update Traefik before Docker Engine — newer Docker can require minimum API versions that old Traefik can't speak