diff --git a/PBS/Tech/Sessions/server-stability-mysql-oom.md b/PBS/Tech/Sessions/server-stability-mysql-oom.md new file mode 100644 index 0000000..11ffb19 --- /dev/null +++ b/PBS/Tech/Sessions/server-stability-mysql-oom.md @@ -0,0 +1,146 @@ +--- +project: server-stability-mysql-oom +type: session-notes +status: active +tags: + - pbs + - docker + - production + - wordpress +created: 2026-03-23 +updated: 2026-03-23 +path: PBS/Tech/Sessions/ +--- + +# Server Stability - MySQL OOM Fix & Memory Monitoring + +## Summary + +Two server crashes in 48 hours (Saturday March 22 ~6AM ET, Monday March 23 ~6:20AM ET) traced to MySQL being OOM-killed by the Linux kernel. Root cause: MySQL had no memory limits and was consuming ~1.8GB before the OOM killer intervened, triggering a cascading failure that made the server completely unresponsive. + +## Investigation Findings + +### OOM Kill Evidence (from systemd journal) +- **Saturday crash:** `Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB` +- **Monday crash:** `Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB` +- Both crashes followed the same pattern: MySQL OOM-killed → Docker restarts MySQL → system still memory-starved → swapoff killed → complete server lockup → manual Linode reboot required + +### Crash Timeline +- Both crashes occurred around 6:00-6:20 AM Eastern (10:00-10:20 UTC — server runs in UTC) +- WooCommerce installed Saturday — first crash Saturday night, second Monday morning +- WooCommerce Action Scheduler showed no failed/stuck tasks — likely not the direct trigger +- Wordfence scan logs showed a ~1 minute scan on March 19 at ~10PM ET — does not align with crash window +- Wordfence scan scheduling is automatic on free tier (no manual schedule control) + +### Ruled Out +- WooCommerce Action Scheduler runaway tasks (all showed completed status) +- Wordfence scan timing (didn't align with crash window) +- Multiple MySQL instances (htop showed threads, not separate processes — press `H` in htop to toggle thread view) + +### Not Yet Determined +- Exact trigger causing MySQL to balloon to 1.8GB overnight +- Whether WooCommerce's added baseline DB load is the tipping point +- `apt-daily.service` was running during Monday's crash — may be contributing to memory pressure + +## Changes Made + +### MySQL Memory Cap & Tuning (compose.yml) +Added to the `mysql` service in `/opt/docker/wordpress/compose.yml`: + +```yaml + mysql: + image: mysql:8.0 + container_name: wordpress_mysql + restart: unless-stopped + deploy: + resources: + limits: + memory: 768M + reservations: + memory: 256M + environment: + MYSQL_DATABASE: wordpress + MYSQL_USER: wordpress + MYSQL_PASSWORD: ${MYSQL_PASSWORD} + MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD} + volumes: + - mysql_data:/var/lib/mysql + networks: + - internal + command: >- + --default-authentication-plugin=mysql_native_password + --innodb-buffer-pool-size=256M + --innodb-log-buffer-size=16M + --max-connections=50 + --key-buffer-size=16M + --tmp-table-size=32M + --max-heap-table-size=32M + --table-open-cache=256 + --performance-schema=OFF +``` + +**What each setting does:** +- `limits: memory: 768M` — Docker kills MySQL if it exceeds 768MB (controlled restart vs kernel OOM) +- `reservations: memory: 256M` — Guarantees MySQL gets at least 256MB +- `innodb-buffer-pool-size=256M` — Caps InnoDB cache (MySQL's biggest memory consumer) +- `max-connections=50` — Reduced from default 151 (less memory per connection) +- `performance-schema=OFF` — Saves ~200-400MB (internal MySQL monitoring not needed) + +**Result:** +| Metric | Before | After | +|--------|--------|-------| +| MySQL memory usage | 474MB (uncapped, spiked to 1.8GB) | 225MB (capped at 768MB) | +| MySQL % of cap | N/A | 29% | +| Total stack memory | ~2.05GB | ~2.0GB | + +### Memory Monitoring Script +Created `/usr/local/bin/docker-mem-log.sh` — logs per-container memory usage every 5 minutes via cron: + +```bash +#!/bin/bash +LOG_FILE="/var/log/pbs-monitoring/container-memory.log" +echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE" +``` + +Cron entry at `/etc/cron.d/docker-mem-monitor`: +``` +*/5 * * * * root /usr/local/bin/docker-mem-log.sh +``` + +**Check logs with:** `tail -20 /var/log/pbs-monitoring/container-memory.log` + +### Journal Persistence Confirmed +- `/var/log/journal` exists and is retaining logs across reboots +- `journalctl --list-boots` shows 5 boot sessions dating back to May 2025 +- OOM kill evidence was successfully retrieved from previous boots + +## Current Server Memory Snapshot (post-fix) +| Container | Memory | % of Limit | +|-----------|--------|------------| +| wordpress | 1.11 GB | 29% (of system) | +| wordpress_mysql | 225 MB | 29% (of 768MB cap) | +| n8n | 200 MB | 5% | +| uptime-kuma | 100 MB | 3% | +| traefik | 37 MB | 1% | +| pbs-api | 28 MB | 1% | +| redis | 13 MB | 2% (of 640MB cap) | +| wpcron | 8 MB | <1% | + +## Still Open + +- [ ] Monitor overnight stability — check memory logs tomorrow AM +- [ ] Add log rotation for `/var/log/pbs-monitoring/container-memory.log` +- [ ] Investigate `apt-daily.service` — consider disabling automatic apt updates +- [ ] Server sizing discussion: 4GB may be tight for adding Gitea + Authelia +- [ ] Determine if Wordfence free-tier scan is contributing to memory pressure +- [ ] Consider setting server timezone to Eastern for easier log reading +- [ ] Investigate root cause of MySQL memory bloat (WooCommerce correlation still strong) + +## Key Learnings + +- **htop shows threads, not processes** — press `H` to toggle thread visibility; one MySQL process can show as dozens of rows +- **systemd journal persists across reboots** if `/var/log/journal` exists and `Storage=auto` or `Storage=persistent` is set +- **`journalctl -b -1`** shows previous boot logs; use `--since`/`--until` for large time ranges to avoid hanging +- **`performance-schema=OFF`** in MySQL saves ~200-400MB with no downside for production WordPress +- **Docker `deploy.resources.limits.memory`** provides a controlled cap — Docker restarts the container instead of the kernel OOM-killing it and cascading +- **Server timezone is UTC** — subtract 4 hours for Eastern time when reading logs \ No newline at end of file