Create server-stability-and-security-hardening.md via n8n
This commit is contained in:
commit
8d5c746578
248
PBS/Tech/Sessions/server-stability-and-security-hardening.md
Normal file
248
PBS/Tech/Sessions/server-stability-and-security-hardening.md
Normal file
@ -0,0 +1,248 @@
|
|||||||
|
---
|
||||||
|
project: server-stability-and-security-hardening
|
||||||
|
type: session-notes
|
||||||
|
status: active
|
||||||
|
tags:
|
||||||
|
- pbs
|
||||||
|
- docker
|
||||||
|
- production
|
||||||
|
- staging
|
||||||
|
- wordpress
|
||||||
|
- traefik
|
||||||
|
- cloudflare
|
||||||
|
- security
|
||||||
|
created: 2026-03-23
|
||||||
|
updated: 2026-03-23
|
||||||
|
path: PBS/Tech/Sessions/
|
||||||
|
---
|
||||||
|
|
||||||
|
# Server Stability, Security Hardening & Staging Fixes - March 23, 2026
|
||||||
|
|
||||||
|
## Session Summary
|
||||||
|
|
||||||
|
Marathon session covering three major areas: (1) production server crash investigation and MySQL/WordPress memory capping, (2) staging Traefik upgrade and debugging, and (3) Cloudflare security and caching improvements. Two server crashes in 48 hours traced to MySQL OOM kills, with a third event tonight traced to WordPress memory bloat caused by bot traffic bursts. All three issues now mitigated with layered defenses.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 1: Production — MySQL OOM Investigation & Fix
|
||||||
|
|
||||||
|
### Root Cause Confirmed
|
||||||
|
Both crashes (Saturday 3/22 ~6AM ET, Monday 3/23 ~6:20AM ET) were caused by MySQL being OOM-killed by the Linux kernel. Confirmed via `journalctl`:
|
||||||
|
- Saturday: `Out of memory: Killed process 4138817 (mysqld) total-vm:1841380kB`
|
||||||
|
- Monday: `Out of memory: Killed process 13015 (mysqld) total-vm:1828060kB`
|
||||||
|
- Both followed same pattern: MySQL OOM-killed → Docker restarts → system still starved → swapoff killed → cascading failure → manual Linode reboot
|
||||||
|
|
||||||
|
### Server Timezone Note
|
||||||
|
Production server runs in **UTC**. Subtract 4 hours for Eastern time. Both crashes appeared as ~10AM UTC in logs but were ~6AM Eastern.
|
||||||
|
|
||||||
|
### Journal Persistence Confirmed
|
||||||
|
- `/var/log/journal` exists and journals survive reboots
|
||||||
|
- `journalctl --list-boots` shows 5 boot sessions back to May 2025
|
||||||
|
- For large time ranges, use `--since`/`--until` flags to avoid hanging
|
||||||
|
|
||||||
|
### Investigation Results
|
||||||
|
- **WooCommerce Action Scheduler:** Cleared — all tasks showed completed status
|
||||||
|
- **Wordfence Scans:** Scan log showed ~1 minute scan on 3/19 at 10PM ET — doesn't align with crash window; scan schedule is automatic on free tier (no manual control)
|
||||||
|
- **htop threads:** Multiple MySQL rows in htop are threads, not processes — press `H` to toggle thread view
|
||||||
|
|
||||||
|
### MySQL Memory Cap Applied
|
||||||
|
Added to `mysql` service in `/opt/docker/wordpress/compose.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
mysql:
|
||||||
|
image: mysql:8.0
|
||||||
|
container_name: wordpress_mysql
|
||||||
|
restart: unless-stopped
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 768M
|
||||||
|
reservations:
|
||||||
|
memory: 256M
|
||||||
|
command: >-
|
||||||
|
--default-authentication-plugin=mysql_native_password
|
||||||
|
--innodb-buffer-pool-size=256M
|
||||||
|
--innodb-log-buffer-size=16M
|
||||||
|
--max-connections=50
|
||||||
|
--key-buffer-size=16M
|
||||||
|
--tmp-table-size=32M
|
||||||
|
--max-heap-table-size=32M
|
||||||
|
--table-open-cache=256
|
||||||
|
--performance-schema=OFF
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key tuning notes:**
|
||||||
|
- `performance-schema=OFF` saves ~200-400MB alone
|
||||||
|
- `max-connections=50` reduced from default 151
|
||||||
|
- `innodb-buffer-pool-size=256M` caps InnoDB's biggest memory consumer
|
||||||
|
|
||||||
|
**Result:** MySQL dropped from 474MB (uncapped) to ~225MB (capped at 768MB, using 29% of cap)
|
||||||
|
|
||||||
|
### Memory Monitoring Script Deployed
|
||||||
|
Created `/usr/local/bin/docker-mem-log.sh` — logs per-container memory every 5 minutes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
LOG_FILE="/var/log/pbs-monitoring/container-memory.log"
|
||||||
|
echo "$(date -u '+%Y-%m-%d %H:%M:%S UTC') | $(docker stats --no-stream --format '{{.Name}}:{{.MemUsage}}' | tr '\n' ' ')" >> "$LOG_FILE"
|
||||||
|
```
|
||||||
|
|
||||||
|
Cron: `/etc/cron.d/docker-mem-monitor`
|
||||||
|
```
|
||||||
|
*/5 * * * * root /usr/local/bin/docker-mem-log.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Check with: `tail -20 /var/log/pbs-monitoring/container-memory.log`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 2: Production — WordPress Memory Spike & Bot Traffic Discovery
|
||||||
|
|
||||||
|
### Memory Monitoring Pays Off
|
||||||
|
The monitoring script caught a WordPress memory spike in real time:
|
||||||
|
|
||||||
|
| Time (UTC) | WordPress | MySQL |
|
||||||
|
|---|---|---|
|
||||||
|
| 02:15 | 1.12 GB | 245 MB |
|
||||||
|
| 02:20 | **2.34 GB** | 178 MB |
|
||||||
|
| 02:30 | **2.91 GB** | 141 MB |
|
||||||
|
|
||||||
|
### Root Cause: Bot Traffic Burst
|
||||||
|
WordPress access logs at 02:16:59 UTC showed ~10+ simultaneous requests in 3 seconds:
|
||||||
|
- Multiple IPs hitting homepage simultaneously via Cloudflare
|
||||||
|
- Requests for random `.flac` and `.webm` files (classic bot probing)
|
||||||
|
- All using `http://` referrer (not `https://`) — not legitimate traffic
|
||||||
|
- Mix of spoofed user agents designed to look like different browsers
|
||||||
|
- Each uncached request spawned a PHP process, causing WordPress to spike to 2.9GB
|
||||||
|
|
||||||
|
### WordPress Memory Cap Applied
|
||||||
|
Added to `wordpress` service in `/opt/docker/wordpress/compose.yml`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 2000M
|
||||||
|
```
|
||||||
|
|
||||||
|
**Result:** WordPress now capped at ~2GB, currently running at ~866MB (43% of cap)
|
||||||
|
|
||||||
|
### Cloudflare Traffic Analysis
|
||||||
|
24-hour stats showed 11.72k total requests with **10.4k uncached (89%)**. Two visible traffic spikes aligned with crash events.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 3: Cloudflare Security & Caching Hardening
|
||||||
|
|
||||||
|
### Security Changes
|
||||||
|
1. **Bot Fight Mode** — Enabled (Security → Settings)
|
||||||
|
2. **WAF Rule: Block suspicious file probes** — Blocks requests ending in `.flac`, `.webm`, `.exe`, `.dll`
|
||||||
|
3. **Rate Limiting Rule: Homepage spam** — 30 requests per 10 seconds per IP, blocks for 10 seconds
|
||||||
|
|
||||||
|
### Caching Changes
|
||||||
|
1. **Browser Cache TTL** — Increased from 4 hours to 1 day
|
||||||
|
2. **Always Online** — Enabled (serves cached pages when server is down)
|
||||||
|
3. **Cache Rule** — Applied Cloudflare "Cache Everything" template:
|
||||||
|
- Cache eligibility: Eligible for cache
|
||||||
|
- Edge TTL: Overrides origin cache-control headers
|
||||||
|
- Browser TTL: Set
|
||||||
|
- Serve stale while revalidating: Enabled
|
||||||
|
|
||||||
|
**Important:** After publishing new content, purge cache via Caching → Configuration → Purge Cache
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 4: Staging — Traefik Upgrade & Debugging
|
||||||
|
|
||||||
|
### Docker API Version Mismatch
|
||||||
|
`apt-get upgrade` on staging updated Docker Engine to v29.2.1 (API v1.53, minimum client API v1.44). Traefik v3.5's built-in Docker client only spoke API v1.24 → Docker rejected all Traefik requests → entire site down.
|
||||||
|
|
||||||
|
**Fix:** Updated Traefik from `v3.5` to `v3.6.11`
|
||||||
|
- v3.6.11 includes Docker API auto-negotiation fix
|
||||||
|
- Also patches 3 CVEs (CVE-2026-32595, CVE-2026-32305, CVE-2026-32695)
|
||||||
|
|
||||||
|
**Production impact:** Must update Traefik on production **before** running `apt-get upgrade`, or the same break will occur. Update Traefik first, then Docker.
|
||||||
|
|
||||||
|
### WordPress Unhealthy Container Issue
|
||||||
|
After Traefik upgrade, WordPress showed as "unhealthy" → Traefik v3.6 respects Docker health status and skips unhealthy containers → site returned 404.
|
||||||
|
|
||||||
|
**Root cause:** MySQL `.env` password contained `$` character, which Docker compose interprets as variable substitution. Password was silently corrupted → WordPress couldn't connect to MySQL → healthcheck failed → Traefik wouldn't route.
|
||||||
|
|
||||||
|
**Fix:** Escaped `$` characters in `.env` file. For future reference: `$` must be doubled (`$$`) in Docker `.env` files.
|
||||||
|
|
||||||
|
**Lesson:** Traefik v3.6+ skips unhealthy containers entirely — they won't show up as routers in the dashboard.
|
||||||
|
|
||||||
|
### PBS Manager Web App (Staging)
|
||||||
|
- Healthcheck using `curl` fails on `python:3.13-slim` (curl not installed)
|
||||||
|
- Fix: Use Python-based healthcheck instead:
|
||||||
|
```yaml
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:5000/api/health')"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 30s
|
||||||
|
```
|
||||||
|
- Code changes require `docker compose up -d --build` (not just `--force-recreate`)
|
||||||
|
- SQLAlchemy models must stay in sync with database schema changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Layered Defense Summary
|
||||||
|
|
||||||
|
| Layer | What It Does | Status |
|
||||||
|
|---|---|---|
|
||||||
|
| Cloudflare Bot Fight Mode | Auto-blocks known bots | ✅ Enabled |
|
||||||
|
| Cloudflare WAF rules | Blocks file probes (.flac, .webm, .exe, .dll) | ✅ Deployed |
|
||||||
|
| Cloudflare Rate Limiting | 30 req/10s per IP on homepage | ✅ Deployed |
|
||||||
|
| Cloudflare Caching | Cache everything, serve stale while revalidating | ✅ Deployed |
|
||||||
|
| Cloudflare Always Online | Serves cached site during outages | ✅ Enabled |
|
||||||
|
| WordPress memory cap | 2GB limit prevents runaway PHP | ✅ Applied |
|
||||||
|
| MySQL memory cap | 768MB limit with tuned buffers | ✅ Applied |
|
||||||
|
| Memory monitoring | Logs per-container stats every 5 min | ✅ Running |
|
||||||
|
| Journal persistence | OOM kill logs survive reboots | ✅ Confirmed |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Current Production Memory Snapshot (post-fixes)
|
||||||
|
|
||||||
|
| Container | Memory | Limit | % of Limit |
|
||||||
|
|---|---|---|---|
|
||||||
|
| wordpress | 866 MB | 2,000 MB | 43% |
|
||||||
|
| n8n | 341 MB | System | 9% |
|
||||||
|
| wordpress_mysql | 190 MB | 768 MB | 25% |
|
||||||
|
| uptime-kuma | 124 MB | System | 3% |
|
||||||
|
| traefik | 56 MB | System | 1% |
|
||||||
|
| redis | 17 MB | 640 MB | 3% |
|
||||||
|
| wpcron | 16 MB | System | <1% |
|
||||||
|
| pbs-api | 14 MB | System | <1% |
|
||||||
|
| **Total** | **~1.62 GB** | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Still Open
|
||||||
|
|
||||||
|
- [ ] Monitor overnight stability — check memory logs tomorrow AM
|
||||||
|
- [ ] Monitor Cloudflare cache hit rate over next 24 hours (should improve dramatically)
|
||||||
|
- [ ] Add log rotation for `/var/log/pbs-monitoring/container-memory.log`
|
||||||
|
- [ ] Update Traefik on production to v3.6.11 **before** running `apt-get upgrade`
|
||||||
|
- [ ] Disable `apt-daily.service` on production (automatic unattended updates)
|
||||||
|
- [ ] Investigate Cloudflare cache hit rate for wp-admin bypass if admin pages serve stale content
|
||||||
|
- [ ] Server sizing discussion still open — 4GB may be tight for Gitea + Authelia
|
||||||
|
- [ ] PBS Manager web app healthcheck and basicauth fixes on staging
|
||||||
|
- [ ] Consider Watchtower on staging only as a canary (discussed and decided against for production)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Learnings
|
||||||
|
|
||||||
|
- **Docker `.env` files treat `$` as variable substitution** — double it (`$$`) or avoid `$` in passwords entirely
|
||||||
|
- **Traefik v3.6+ skips unhealthy containers** — if a container's healthcheck fails, Traefik won't route to it (no error, just missing from dashboard)
|
||||||
|
- **`docker compose up -d --force-recreate`** only recreates from existing image; use `--build` for code changes
|
||||||
|
- **Docker API versions ≠ Docker product versions** — API v1.24 vs v1.44 are protocol versions, not Docker Engine versions
|
||||||
|
- **`performance-schema=OFF`** in MySQL saves ~200-400MB with no downside for WordPress
|
||||||
|
- **89% uncached Cloudflare traffic** was caused by WordPress sending `no-cache` headers — override with Edge TTL rule
|
||||||
|
- **Bot traffic patterns:** simultaneous requests from multiple IPs, random file probes, `http://` referrers, mixed user agents
|
||||||
|
- **Memory monitoring script** proved its value immediately — caught WordPress spike in real time
|
||||||
|
- **Watchtower not recommended for production** — prefer deliberate manual updates tested on staging first
|
||||||
|
- **Always update Traefik before Docker Engine** — newer Docker can require minimum API versions that old Traefik can't speak
|
||||||
Loading…
Reference in New Issue
Block a user