herbygitea/pbs-obsidian-vault

Fork 0

herbygitea 8d5c746578 Create server-stability-and-security-hardening.md via n8n

2026-03-24 18:34:28 +00:00

10 KiB

Raw Blame History

project

type

status

Server Stability, Security Hardening & Staging Fixes - March 23, 2026

Session Summary

Marathon session covering three major areas: (1) production server crash investigation and MySQL/WordPress memory capping, (2) staging Traefik upgrade and debugging, and (3) Cloudflare security and caching improvements. Two server crashes in 48 hours traced to MySQL OOM kills, with a third event tonight traced to WordPress memory bloat caused by bot traffic bursts. All three issues now mitigated with layered defenses.

Part 1: Production — MySQL OOM Investigation & Fix

Root Cause Confirmed

Both crashes (Saturday 3/22 ~6AM ET, Monday 3/23 ~6:20AM ET) were caused by MySQL being OOM-killed by the Linux kernel. Confirmed via journalctl:

Saturday: Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB
Monday: Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB
Both followed same pattern: MySQL OOM-killed → Docker restarts → system still starved → swapoff killed → cascading failure → manual Linode reboot

Server Timezone Note

Production server runs in UTC. Subtract 4 hours for Eastern time. Both crashes appeared as ~10AM UTC in logs but were ~6AM Eastern.

Journal Persistence Confirmed

/var/log/journal exists and journals survive reboots
journalctl --list-boots shows 5 boot sessions back to May 2025
For large time ranges, use --since/--until flags to avoid hanging

Investigation Results

WooCommerce Action Scheduler: Cleared — all tasks showed completed status
Wordfence Scans: Scan log showed ~1 minute scan on 3/19 at 10PM ET — doesn't align with crash window; scan schedule is automatic on free tier (no manual control)
htop threads: Multiple MySQL rows in htop are threads, not processes — press H to toggle thread view

MySQL Memory Cap Applied

Added to mysql service in /opt/docker/wordpress/compose.yml:

mysql:
  image: mysql:8.0
  container_name: wordpress_mysql
  restart: unless-stopped
  deploy:
    resources:
      limits:
        memory: 768M
      reservations:
        memory: 256M
  command: >-
    --default-authentication-plugin=mysql_native_password
    --innodb-buffer-pool-size=256M
    --innodb-log-buffer-size=16M
    --max-connections=50
    --key-buffer-size=16M
    --tmp-table-size=32M
    --max-heap-table-size=32M
    --table-open-cache=256
    --performance-schema=OFF

Key tuning notes:

performance-schema=OFF saves ~200-400MB alone
max-connections=50 reduced from default 151
innodb-buffer-pool-size=256M caps InnoDB's biggest memory consumer

Result: MySQL dropped from 474MB (uncapped) to ~225MB (capped at 768MB, using 29% of cap)

Memory Monitoring Script Deployed

Created /usr/local/bin/docker-mem-log.sh — logs per-container memory every 5 minutes:

#!/bin/bash
LOG_FILE="/var/log/pbs-monitoring/container-memory.log"
echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE"

Cron: /etc/cron.d/docker-mem-monitor

*/5 * * * * root /usr/local/bin/docker-mem-log.sh

Check with: tail -20 /var/log/pbs-monitoring/container-memory.log

Part 2: Production — WordPress Memory Spike & Bot Traffic Discovery

Memory Monitoring Pays Off

The monitoring script caught a WordPress memory spike in real time:

Time (UTC)	WordPress	MySQL
02:15	1.12 GB	245 MB
02:20	2.34 GB	178 MB
02:30	2.91 GB	141 MB

Root Cause: Bot Traffic Burst

WordPress access logs at 02:16:59 UTC showed ~10+ simultaneous requests in 3 seconds:

Multiple IPs hitting homepage simultaneously via Cloudflare
Requests for random .flac and .webm files (classic bot probing)
All using http:// referrer (not https://) — not legitimate traffic
Mix of spoofed user agents designed to look like different browsers
Each uncached request spawned a PHP process, causing WordPress to spike to 2.9GB

WordPress Memory Cap Applied

Added to wordpress service in /opt/docker/wordpress/compose.yml:

deploy:
  resources:
    limits:
      memory: 2000M

Result: WordPress now capped at ~2GB, currently running at ~866MB (43% of cap)

Cloudflare Traffic Analysis

24-hour stats showed 11.72k total requests with 10.4k uncached (89%). Two visible traffic spikes aligned with crash events.

Part 3: Cloudflare Security & Caching Hardening

Security Changes

Bot Fight Mode — Enabled (Security → Settings)
WAF Rule: Block suspicious file probes — Blocks requests ending in .flac, .webm, .exe, .dll
Rate Limiting Rule: Homepage spam — 30 requests per 10 seconds per IP, blocks for 10 seconds

Caching Changes

Browser Cache TTL — Increased from 4 hours to 1 day
Always Online — Enabled (serves cached pages when server is down)
Cache Rule — Applied Cloudflare "Cache Everything" template:
- Cache eligibility: Eligible for cache
- Edge TTL: Overrides origin cache-control headers
- Browser TTL: Set
- Serve stale while revalidating: Enabled

Important: After publishing new content, purge cache via Caching → Configuration → Purge Cache

Part 4: Staging — Traefik Upgrade & Debugging

Docker API Version Mismatch

apt-get upgrade on staging updated Docker Engine to v29.2.1 (API v1.53, minimum client API v1.44). Traefik v3.5's built-in Docker client only spoke API v1.24 → Docker rejected all Traefik requests → entire site down.

Fix: Updated Traefik from v3.5 to v3.6.11

v3.6.11 includes Docker API auto-negotiation fix
Also patches 3 CVEs (CVE-2026-32595, CVE-2026-32305, CVE-2026-32695)

Production impact: Must update Traefik on production before running apt-get upgrade, or the same break will occur. Update Traefik first, then Docker.

WordPress Unhealthy Container Issue

After Traefik upgrade, WordPress showed as "unhealthy" → Traefik v3.6 respects Docker health status and skips unhealthy containers → site returned 404.

Root cause: MySQL .env password contained $ character, which Docker compose interprets as variable substitution. Password was silently corrupted → WordPress couldn't connect to MySQL → healthcheck failed → Traefik wouldn't route.

Fix: Escaped $ characters in .env file. For future reference: $ must be doubled ($$) in Docker .env files.

Lesson: Traefik v3.6+ skips unhealthy containers entirely — they won't show up as routers in the dashboard.

PBS Manager Web App (Staging)

Healthcheck using curl fails on python:3.13-slim (curl not installed)
Fix: Use Python-based healthcheck instead:

healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:5000/api/health')"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 30s

Code changes require docker compose up -d --build (not just --force-recreate)
SQLAlchemy models must stay in sync with database schema changes

Layered Defense Summary

Layer	What It Does	Status
Cloudflare Bot Fight Mode	Auto-blocks known bots	✅ Enabled
Cloudflare WAF rules	Blocks file probes (.flac, .webm, .exe, .dll)	✅ Deployed
Cloudflare Rate Limiting	30 req/10s per IP on homepage	✅ Deployed
Cloudflare Caching	Cache everything, serve stale while revalidating	✅ Deployed
Cloudflare Always Online	Serves cached site during outages	✅ Enabled
WordPress memory cap	2GB limit prevents runaway PHP	✅ Applied
MySQL memory cap	768MB limit with tuned buffers	✅ Applied
Memory monitoring	Logs per-container stats every 5 min	✅ Running
Journal persistence	OOM kill logs survive reboots	✅ Confirmed

Current Production Memory Snapshot (post-fixes)

Container	Memory	Limit	% of Limit
wordpress	866 MB	2,000 MB	43%
n8n	341 MB	System	9%
wordpress_mysql	190 MB	768 MB	25%
uptime-kuma	124 MB	System	3%
traefik	56 MB	System	1%
redis	17 MB	640 MB	3%
wpcron	16 MB	System	<1%
pbs-api	14 MB	System	<1%
Total	~1.62 GB

Still Open

Monitor overnight stability — check memory logs tomorrow AM
Monitor Cloudflare cache hit rate over next 24 hours (should improve dramatically)
Add log rotation for /var/log/pbs-monitoring/container-memory.log
Update Traefik on production to v3.6.11 before running apt-get upgrade
Disable apt-daily.service on production (automatic unattended updates)
Investigate Cloudflare cache hit rate for wp-admin bypass if admin pages serve stale content
Server sizing discussion still open — 4GB may be tight for Gitea + Authelia
PBS Manager web app healthcheck and basicauth fixes on staging
Consider Watchtower on staging only as a canary (discussed and decided against for production)

Key Learnings

Docker .env files treat $ as variable substitution — double it ($$) or avoid $ in passwords entirely
Traefik v3.6+ skips unhealthy containers — if a container's healthcheck fails, Traefik won't route to it (no error, just missing from dashboard)
docker compose up -d --force-recreate only recreates from existing image; use --build for code changes
Docker API versions ≠ Docker product versions — API v1.24 vs v1.44 are protocol versions, not Docker Engine versions
performance-schema=OFF in MySQL saves ~200-400MB with no downside for WordPress
89% uncached Cloudflare traffic was caused by WordPress sending no-cache headers — override with Edge TTL rule
Bot traffic patterns: simultaneous requests from multiple IPs, random file probes, http:// referrers, mixed user agents
Memory monitoring script proved its value immediately — caught WordPress spike in real time
Watchtower not recommended for production — prefer deliberate manual updates tested on staging first
Always update Traefik before Docker Engine — newer Docker can require minimum API versions that old Traefik can't speak

10 KiB Raw Blame History