pbs-obsidian-vault/PBS/Tech/Sessions/server-stability-mysql-oom.md

146 lines
5.9 KiB
Markdown

---
project: server-stability-mysql-oom
type: session-notes
status: active
tags:
- pbs
- docker
- production
- wordpress
created: 2026-03-23
updated: 2026-03-23
path: PBS/Tech/Sessions/
---
# Server Stability - MySQL OOM Fix & Memory Monitoring
## Summary
Two server crashes in 48 hours (Saturday March 22 ~6AM ET, Monday March 23 ~6:20AM ET) traced to MySQL being OOM-killed by the Linux kernel. Root cause: MySQL had no memory limits and was consuming ~1.8GB before the OOM killer intervened, triggering a cascading failure that made the server completely unresponsive.
## Investigation Findings
### OOM Kill Evidence (from systemd journal)
- **Saturday crash:** `Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB`
- **Monday crash:** `Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB`
- Both crashes followed the same pattern: MySQL OOM-killed → Docker restarts MySQL → system still memory-starved → swapoff killed → complete server lockup → manual Linode reboot required
### Crash Timeline
- Both crashes occurred around 6:00-6:20 AM Eastern (10:00-10:20 UTC — server runs in UTC)
- WooCommerce installed Saturday — first crash Saturday night, second Monday morning
- WooCommerce Action Scheduler showed no failed/stuck tasks — likely not the direct trigger
- Wordfence scan logs showed a ~1 minute scan on March 19 at ~10PM ET — does not align with crash window
- Wordfence scan scheduling is automatic on free tier (no manual schedule control)
### Ruled Out
- WooCommerce Action Scheduler runaway tasks (all showed completed status)
- Wordfence scan timing (didn't align with crash window)
- Multiple MySQL instances (htop showed threads, not separate processes — press `H` in htop to toggle thread view)
### Not Yet Determined
- Exact trigger causing MySQL to balloon to 1.8GB overnight
- Whether WooCommerce's added baseline DB load is the tipping point
- `apt-daily.service` was running during Monday's crash — may be contributing to memory pressure
## Changes Made
### MySQL Memory Cap & Tuning (compose.yml)
Added to the `mysql` service in `/opt/docker/wordpress/compose.yml`:
```yaml
mysql:
image: mysql:8.0
container_name: wordpress_mysql
restart: unless-stopped
deploy:
resources:
limits:
memory: 768M
reservations:
memory: 256M
environment:
MYSQL_DATABASE: wordpress
MYSQL_USER: wordpress
MYSQL_PASSWORD: ${MYSQL_PASSWORD}
MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD}
volumes:
- mysql_data:/var/lib/mysql
networks:
- internal
command: >-
--default-authentication-plugin=mysql_native_password
--innodb-buffer-pool-size=256M
--innodb-log-buffer-size=16M
--max-connections=50
--key-buffer-size=16M
--tmp-table-size=32M
--max-heap-table-size=32M
--table-open-cache=256
--performance-schema=OFF
```
**What each setting does:**
- `limits: memory: 768M` — Docker kills MySQL if it exceeds 768MB (controlled restart vs kernel OOM)
- `reservations: memory: 256M` — Guarantees MySQL gets at least 256MB
- `innodb-buffer-pool-size=256M` — Caps InnoDB cache (MySQL's biggest memory consumer)
- `max-connections=50` — Reduced from default 151 (less memory per connection)
- `performance-schema=OFF` — Saves ~200-400MB (internal MySQL monitoring not needed)
**Result:**
| Metric | Before | After |
|--------|--------|-------|
| MySQL memory usage | 474MB (uncapped, spiked to 1.8GB) | 225MB (capped at 768MB) |
| MySQL % of cap | N/A | 29% |
| Total stack memory | ~2.05GB | ~2.0GB |
### Memory Monitoring Script
Created `/usr/local/bin/docker-mem-log.sh` — logs per-container memory usage every 5 minutes via cron:
```bash
#!/bin/bash
LOG_FILE="/var/log/pbs-monitoring/container-memory.log"
echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE"
```
Cron entry at `/etc/cron.d/docker-mem-monitor`:
```
*/5 * * * * root /usr/local/bin/docker-mem-log.sh
```
**Check logs with:** `tail -20 /var/log/pbs-monitoring/container-memory.log`
### Journal Persistence Confirmed
- `/var/log/journal` exists and is retaining logs across reboots
- `journalctl --list-boots` shows 5 boot sessions dating back to May 2025
- OOM kill evidence was successfully retrieved from previous boots
## Current Server Memory Snapshot (post-fix)
| Container | Memory | % of Limit |
|-----------|--------|------------|
| wordpress | 1.11 GB | 29% (of system) |
| wordpress_mysql | 225 MB | 29% (of 768MB cap) |
| n8n | 200 MB | 5% |
| uptime-kuma | 100 MB | 3% |
| traefik | 37 MB | 1% |
| pbs-api | 28 MB | 1% |
| redis | 13 MB | 2% (of 640MB cap) |
| wpcron | 8 MB | <1% |
## Still Open
- [ ] Monitor overnight stability check memory logs tomorrow AM
- [ ] Add log rotation for `/var/log/pbs-monitoring/container-memory.log`
- [ ] Investigate `apt-daily.service` consider disabling automatic apt updates
- [ ] Server sizing discussion: 4GB may be tight for adding Gitea + Authelia
- [ ] Determine if Wordfence free-tier scan is contributing to memory pressure
- [ ] Consider setting server timezone to Eastern for easier log reading
- [ ] Investigate root cause of MySQL memory bloat (WooCommerce correlation still strong)
## Key Learnings
- **htop shows threads, not processes** press `H` to toggle thread visibility; one MySQL process can show as dozens of rows
- **systemd journal persists across reboots** if `/var/log/journal` exists and `Storage=auto` or `Storage=persistent` is set
- **`journalctl -b -1`** shows previous boot logs; use `--since`/`--until` for large time ranges to avoid hanging
- **`performance-schema=OFF`** in MySQL saves ~200-400MB with no downside for production WordPress
- **Docker `deploy.resources.limits.memory`** provides a controlled cap Docker restarts the container instead of the kernel OOM-killing it and cascading
- **Server timezone is UTC** subtract 4 hours for Eastern time when reading logs