pbs-obsidian-vault/PBS/Tech/Sessions/server-stability-and-security-hardening.md

10 KiB

project type status tags created updated path
server-stability-and-security-hardening session-notes active
pbs
docker
production
staging
wordpress
traefik
cloudflare
security
2026-03-23 2026-03-23 PBS/Tech/Sessions/

Server Stability, Security Hardening & Staging Fixes - March 23, 2026

Session Summary

Marathon session covering three major areas: (1) production server crash investigation and MySQL/WordPress memory capping, (2) staging Traefik upgrade and debugging, and (3) Cloudflare security and caching improvements. Two server crashes in 48 hours traced to MySQL OOM kills, with a third event tonight traced to WordPress memory bloat caused by bot traffic bursts. All three issues now mitigated with layered defenses.


Part 1: Production — MySQL OOM Investigation & Fix

Root Cause Confirmed

Both crashes (Saturday 3/22 ~6AM ET, Monday 3/23 ~6:20AM ET) were caused by MySQL being OOM-killed by the Linux kernel. Confirmed via journalctl:

  • Saturday: Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB
  • Monday: Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB
  • Both followed same pattern: MySQL OOM-killed → Docker restarts → system still starved → swapoff killed → cascading failure → manual Linode reboot

Server Timezone Note

Production server runs in UTC. Subtract 4 hours for Eastern time. Both crashes appeared as ~10AM UTC in logs but were ~6AM Eastern.

Journal Persistence Confirmed

  • /var/log/journal exists and journals survive reboots
  • journalctl --list-boots shows 5 boot sessions back to May 2025
  • For large time ranges, use --since/--until flags to avoid hanging

Investigation Results

  • WooCommerce Action Scheduler: Cleared — all tasks showed completed status
  • Wordfence Scans: Scan log showed ~1 minute scan on 3/19 at 10PM ET — doesn't align with crash window; scan schedule is automatic on free tier (no manual control)
  • htop threads: Multiple MySQL rows in htop are threads, not processes — press H to toggle thread view

MySQL Memory Cap Applied

Added to mysql service in /opt/docker/wordpress/compose.yml:

mysql:
  image: mysql:8.0
  container_name: wordpress_mysql
  restart: unless-stopped
  deploy:
    resources:
      limits:
        memory: 768M
      reservations:
        memory: 256M
  command: >-
    --default-authentication-plugin=mysql_native_password
    --innodb-buffer-pool-size=256M
    --innodb-log-buffer-size=16M
    --max-connections=50
    --key-buffer-size=16M
    --tmp-table-size=32M
    --max-heap-table-size=32M
    --table-open-cache=256
    --performance-schema=OFF    

Key tuning notes:

  • performance-schema=OFF saves ~200-400MB alone
  • max-connections=50 reduced from default 151
  • innodb-buffer-pool-size=256M caps InnoDB's biggest memory consumer

Result: MySQL dropped from 474MB (uncapped) to ~225MB (capped at 768MB, using 29% of cap)

Memory Monitoring Script Deployed

Created /usr/local/bin/docker-mem-log.sh — logs per-container memory every 5 minutes:

#!/bin/bash
LOG_FILE="/var/log/pbs-monitoring/container-memory.log"
echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE"

Cron: /etc/cron.d/docker-mem-monitor

*/5 * * * * root /usr/local/bin/docker-mem-log.sh

Check with: tail -20 /var/log/pbs-monitoring/container-memory.log


Part 2: Production — WordPress Memory Spike & Bot Traffic Discovery

Memory Monitoring Pays Off

The monitoring script caught a WordPress memory spike in real time:

Time (UTC) WordPress MySQL
02:15 1.12 GB 245 MB
02:20 2.34 GB 178 MB
02:30 2.91 GB 141 MB

Root Cause: Bot Traffic Burst

WordPress access logs at 02:16:59 UTC showed ~10+ simultaneous requests in 3 seconds:

  • Multiple IPs hitting homepage simultaneously via Cloudflare
  • Requests for random .flac and .webm files (classic bot probing)
  • All using http:// referrer (not https://) — not legitimate traffic
  • Mix of spoofed user agents designed to look like different browsers
  • Each uncached request spawned a PHP process, causing WordPress to spike to 2.9GB

WordPress Memory Cap Applied

Added to wordpress service in /opt/docker/wordpress/compose.yml:

deploy:
  resources:
    limits:
      memory: 2000M

Result: WordPress now capped at ~2GB, currently running at ~866MB (43% of cap)

Cloudflare Traffic Analysis

24-hour stats showed 11.72k total requests with 10.4k uncached (89%). Two visible traffic spikes aligned with crash events.


Part 3: Cloudflare Security & Caching Hardening

Security Changes

  1. Bot Fight Mode — Enabled (Security → Settings)
  2. WAF Rule: Block suspicious file probes — Blocks requests ending in .flac, .webm, .exe, .dll
  3. Rate Limiting Rule: Homepage spam — 30 requests per 10 seconds per IP, blocks for 10 seconds

Caching Changes

  1. Browser Cache TTL — Increased from 4 hours to 1 day
  2. Always Online — Enabled (serves cached pages when server is down)
  3. Cache Rule — Applied Cloudflare "Cache Everything" template:
    • Cache eligibility: Eligible for cache
    • Edge TTL: Overrides origin cache-control headers
    • Browser TTL: Set
    • Serve stale while revalidating: Enabled

Important: After publishing new content, purge cache via Caching → Configuration → Purge Cache


Part 4: Staging — Traefik Upgrade & Debugging

Docker API Version Mismatch

apt-get upgrade on staging updated Docker Engine to v29.2.1 (API v1.53, minimum client API v1.44). Traefik v3.5's built-in Docker client only spoke API v1.24 → Docker rejected all Traefik requests → entire site down.

Fix: Updated Traefik from v3.5 to v3.6.11

  • v3.6.11 includes Docker API auto-negotiation fix
  • Also patches 3 CVEs (CVE-2026-32595, CVE-2026-32305, CVE-2026-32695)

Production impact: Must update Traefik on production before running apt-get upgrade, or the same break will occur. Update Traefik first, then Docker.

WordPress Unhealthy Container Issue

After Traefik upgrade, WordPress showed as "unhealthy" → Traefik v3.6 respects Docker health status and skips unhealthy containers → site returned 404.

Root cause: MySQL .env password contained $ character, which Docker compose interprets as variable substitution. Password was silently corrupted → WordPress couldn't connect to MySQL → healthcheck failed → Traefik wouldn't route.

Fix: Escaped $ characters in .env file. For future reference: $ must be doubled ($$) in Docker .env files.

Lesson: Traefik v3.6+ skips unhealthy containers entirely — they won't show up as routers in the dashboard.

PBS Manager Web App (Staging)

  • Healthcheck using curl fails on python:3.13-slim (curl not installed)
  • Fix: Use Python-based healthcheck instead:
healthcheck:
  test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:5000/api/health')"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 30s
  • Code changes require docker compose up -d --build (not just --force-recreate)
  • SQLAlchemy models must stay in sync with database schema changes

Layered Defense Summary

Layer What It Does Status
Cloudflare Bot Fight Mode Auto-blocks known bots Enabled
Cloudflare WAF rules Blocks file probes (.flac, .webm, .exe, .dll) Deployed
Cloudflare Rate Limiting 30 req/10s per IP on homepage Deployed
Cloudflare Caching Cache everything, serve stale while revalidating Deployed
Cloudflare Always Online Serves cached site during outages Enabled
WordPress memory cap 2GB limit prevents runaway PHP Applied
MySQL memory cap 768MB limit with tuned buffers Applied
Memory monitoring Logs per-container stats every 5 min Running
Journal persistence OOM kill logs survive reboots Confirmed

Current Production Memory Snapshot (post-fixes)

Container Memory Limit % of Limit
wordpress 866 MB 2,000 MB 43%
n8n 341 MB System 9%
wordpress_mysql 190 MB 768 MB 25%
uptime-kuma 124 MB System 3%
traefik 56 MB System 1%
redis 17 MB 640 MB 3%
wpcron 16 MB System <1%
pbs-api 14 MB System <1%
Total ~1.62 GB

Still Open

  • Monitor overnight stability — check memory logs tomorrow AM
  • Monitor Cloudflare cache hit rate over next 24 hours (should improve dramatically)
  • Add log rotation for /var/log/pbs-monitoring/container-memory.log
  • Update Traefik on production to v3.6.11 before running apt-get upgrade
  • Disable apt-daily.service on production (automatic unattended updates)
  • Investigate Cloudflare cache hit rate for wp-admin bypass if admin pages serve stale content
  • Server sizing discussion still open — 4GB may be tight for Gitea + Authelia
  • PBS Manager web app healthcheck and basicauth fixes on staging
  • Consider Watchtower on staging only as a canary (discussed and decided against for production)

Key Learnings

  • Docker .env files treat $ as variable substitution — double it ($$) or avoid $ in passwords entirely
  • Traefik v3.6+ skips unhealthy containers — if a container's healthcheck fails, Traefik won't route to it (no error, just missing from dashboard)
  • docker compose up -d --force-recreate only recreates from existing image; use --build for code changes
  • Docker API versions ≠ Docker product versions — API v1.24 vs v1.44 are protocol versions, not Docker Engine versions
  • performance-schema=OFF in MySQL saves ~200-400MB with no downside for WordPress
  • 89% uncached Cloudflare traffic was caused by WordPress sending no-cache headers — override with Edge TTL rule
  • Bot traffic patterns: simultaneous requests from multiple IPs, random file probes, http:// referrers, mixed user agents
  • Memory monitoring script proved its value immediately — caught WordPress spike in real time
  • Watchtower not recommended for production — prefer deliberate manual updates tested on staging first
  • Always update Traefik before Docker Engine — newer Docker can require minimum API versions that old Traefik can't speak