--- project: pbs-youtube-analytics type: project-plan status: active tags: - pbs - youtube - python - automation - n8n - flask - streamlit - analytics created: 2026-03-23 updated: 2026-03-23 path: PBS/Tech/Projects/ --- # PBS YouTube Analytics Pipeline ## Project Goal Build a self-hosted YouTube analytics pipeline for the PBS channel that collects video performance data (with a focus on audience retention), stores it in SQLite, automates collection via n8n, sends alerts to Google Chat, and visualizes insights through a Streamlit dashboard. ## Why This Matters YouTube Studio's built-in analytics are limited and don't let us slice data the way we need. By owning the raw data, Travis can do proper analysis in Python/R, and Jenny gets a clean dashboard showing what's actually working in our content — especially where viewers drop off or rewatch. --- ## Architecture Overview ``` YouTube Analytics API | Python Collector Script (PyCharm + UV) | SQLite Database (self-contained file) | ┌────┴────┐ │ │ n8n Streamlit (schedule (dashboard + alerts) via Traefik) ``` - **Data Collection:** Python script using `google-api-python-client` + `google-auth-oauthlib` - **Storage:** SQLite database file (lightweight, portable, perfect for read-heavy analytics) - **Automation:** n8n triggers collection on schedule, sends Google Chat alerts - **Visualization:** Streamlit app served as Docker container behind Traefik --- ## Phase 1: Google Cloud + API Setup **Estimated Time:** 1-2 hours **Goal:** Get API credentials and verify access to PBS YouTube data ### Tasks - [ ] Create Google Cloud project (or use existing PBS project) - [ ] Enable YouTube Data API v3 - [ ] Enable YouTube Analytics API v2 - [ ] Configure OAuth consent screen (Internal if using Workspace, External otherwise) - [ ] Create OAuth 2.0 Desktop App credentials - [ ] Download `client_secret.json` - [ ] Test OAuth flow — authorize and confirm access to PBS channel data ### Key Details - **Required OAuth scope:** ` https://www.googleapis.com/auth/yt-analytics.readonly` - **Additional scope for video metadata:** ` https://www.googleapis.com/auth/youtube.readonly` - OAuth tokens will be stored securely and refreshed automatically - First auth requires browser interaction; subsequent runs use refresh token ### Deliverable Working OAuth credentials that can query the PBS channel's analytics data --- ## Phase 2: Python Data Collector **Estimated Time:** 3-4 hours **Goal:** Python script that pulls video stats and retention data into SQLite **Tools:** PyCharm Professional, UV package manager ### Tasks - [ ] Initialize project with UV (`uv init pbs-youtube-analytics`) - [ ] Install dependencies: `google-api-python-client`, `google-auth-oauthlib`, `google-auth-httplib2` - [ ] Build OAuth2 auth module with token persistence (refresh token stored in JSON) - [ ] Build video list collector (pulls all PBS videos/shorts with metadata) - [ ] Build retention data collector (audience retention curves per video) - [ ] Build general metrics collector (views, watch time, likes, traffic sources, etc.) - [ ] Design and create SQLite schema - [ ] Implement data ingestion with upsert logic (idempotent runs) - [ ] Add CLI interface for manual runs and backfill - [ ] Test with real PBS channel data ### SQLite Schema (Initial Design) ```sql -- Video metadata from Data API CREATE TABLE videos ( video_id TEXT PRIMARY KEY, title TEXT NOT NULL, published_at TEXT NOT NULL, duration_seconds INTEGER, video_type TEXT CHECK(video_type IN ('video', 'short')), thumbnail_url TEXT, description TEXT, created_at TEXT DEFAULT CURRENT_TIMESTAMP, updated_at TEXT DEFAULT CURRENT_TIMESTAMP ); -- Daily aggregate metrics from Analytics API CREATE TABLE video_daily_metrics ( video_id TEXT NOT NULL, date TEXT NOT NULL, views INTEGER DEFAULT 0, estimated_minutes_watched REAL DEFAULT 0, average_view_duration_seconds REAL DEFAULT 0, average_view_percentage REAL DEFAULT 0, likes INTEGER DEFAULT 0, dislikes INTEGER DEFAULT 0, comments INTEGER DEFAULT 0, shares INTEGER DEFAULT 0, subscribers_gained INTEGER DEFAULT 0, subscribers_lost INTEGER DEFAULT 0, impressions INTEGER DEFAULT 0, impressions_ctr REAL DEFAULT 0, created_at TEXT DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (video_id, date), FOREIGN KEY (video_id) REFERENCES videos(video_id) ); -- Audience retention curve (100 data points per video) CREATE TABLE video_retention ( video_id TEXT NOT NULL, elapsed_ratio REAL NOT NULL, audience_watch_ratio REAL NOT NULL, relative_retention_performance REAL, fetched_at TEXT DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (video_id, elapsed_ratio), FOREIGN KEY (video_id) REFERENCES videos(video_id) ); -- Traffic source breakdown per video per day CREATE TABLE video_traffic_sources ( video_id TEXT NOT NULL, date TEXT NOT NULL, traffic_source TEXT NOT NULL, views INTEGER DEFAULT 0, estimated_minutes_watched REAL DEFAULT 0, PRIMARY KEY (video_id, date, traffic_source), FOREIGN KEY (video_id) REFERENCES videos(video_id) ); -- Channel-level daily summary CREATE TABLE channel_daily_metrics ( date TEXT PRIMARY KEY, total_views INTEGER DEFAULT 0, total_estimated_minutes_watched REAL DEFAULT 0, subscribers_gained INTEGER DEFAULT 0, subscribers_lost INTEGER DEFAULT 0, net_subscribers INTEGER DEFAULT 0, created_at TEXT DEFAULT CURRENT_TIMESTAMP ); ``` ### API Details — Retention Data - **Endpoint:** YouTube Analytics API v2 `reports.query` - **Dimension:** `elapsedVideoTimeRatio` (100 data points, values 0.01 to 1.0) - **Metrics available:** - `audienceWatchRatio` — absolute retention (can exceed 1.0 for rewatched segments) - `relativeRetentionPerformance` — compared to similar-length YouTube videos (0 to 1 scale) - `startedWatching` — how often viewers started watching at this point - `stoppedWatching` — how often viewers stopped watching at this point - **Limitation:** Retention data is per-video only (one video per API call, no further dimension splits) - **Note:** For a 60-second Short, each data point ≈ 0.6 seconds. For a 10-minute video, each ≈ 6 seconds. ### Deliverable Python CLI tool that pulls all PBS video data + retention curves into a local SQLite database --- ## Phase 3: n8n Automation + Alerts **Estimated Time:** 2-3 hours **Goal:** Automate daily data collection and send performance alerts to Google Chat ### Tasks - [ ] Deploy collector script to Linode server (alongside n8n) - [ ] Create n8n workflow: daily scheduled trigger → Execute Command node → runs Python collector - [ ] Add error handling: notify Google Chat on collection failures - [ ] Create weekly digest alert: top performing videos, notable retention patterns - [ ] Create threshold alerts: video crosses view milestones, unusual engagement spikes - [ ] Test scheduled execution end-to-end ### Alert Ideas - **Weekly Digest (for Jenny):** Top 5 videos this week by views, best retention video, shorts vs long-form comparison - **Spike Alert:** Video gets 2x+ its average daily views - **Milestone Alert:** Video crosses 1K, 5K, 10K views - **New Video Check-in:** 48-hour performance report for newly published content ### Deliverable Automated daily collection with Google Chat alerts for notable events --- ## Phase 4: Streamlit Dashboard **Estimated Time:** 4-6 hours **Goal:** Interactive web dashboard for Jenny and Travis to explore PBS YouTube performance ### Tasks - [ ] Initialize Streamlit project with UV - [ ] Build retention heatmap view (the star feature) - [ ] Build video comparison view (side-by-side retention curves) - [ ] Build channel overview page (trends over time) - [ ] Build shorts vs long-form comparison view - [ ] Build traffic source analysis view - [ ] Dockerize Streamlit app - [ ] Add to docker-compose with Traefik labels - [ ] Deploy to staging first, then production - [ ] Secure with Authelia (when SSO rollout happens) or basic auth initially ### Dashboard Pages (Initial Concept) 1. **Channel Overview** — subscriber trend, total views/watch time over time, publishing cadence 2. **Video Deep Dive** — select a video, see retention curve, daily metrics, traffic sources 3. **Retention Heatmap** — all videos on one view, color-coded by retention quality at each time segment 4. **Shorts Lab** — Shorts-specific view comparing hook effectiveness (first 3 seconds), rewatch rates 5. **What's Working** — auto-surfaced insights: best retention patterns, top traffic sources, optimal video length ### Deployment - Streamlit container behind Traefik at `analytics.plantbasedsoutherner.com` (or similar subdomain) - Reads from same SQLite file populated by the collector - Protected by basic auth initially, Authelia later ### Deliverable Live dashboard accessible to Jenny and Travis showing PBS YouTube performance with retention analysis --- ## Phase 5: Advanced Analysis & Iteration **Estimated Time:** Ongoing **Goal:** Leverage the data for deeper content strategy insights ### Future Ideas - [ ] Correlate retention patterns with recipe categories (link to `pbs_recipes` table) - [ ] A/B analysis: compare thumbnail styles, intro approaches, video lengths - [ ] Optimal posting time analysis using traffic source timing data - [ ] Export data to R for statistical modeling - [ ] Instagram vs YouTube cross-platform performance comparison - [ ] Automated content recommendations based on what's performing --- ## Prerequisites & Dependencies | Requirement | Status | Notes | |---|---|---| | Google Cloud project | Needed | May already exist for Google Workspace | | YouTube Analytics API enabled | Needed | Free, quota-based | | OAuth 2.0 credentials | Needed | Desktop app type | | Python + UV | Ready | Travis's local dev setup | | Linode server access | Ready | Same server running n8n | | n8n operational | Ready | Already running PBS automation | | Traefik reverse proxy | Ready | For Streamlit subdomain | | SQLite | Ready | Ships with Python, no setup needed | --- ## API Quotas & Limits - YouTube Analytics API: 200 queries/day default (can request increase) - YouTube Data API v3: 10,000 units/day (listing videos costs ~1-3 units each) - Retention data: one video per API call (plan batch collection accordingly) - Data availability: typically 2-3 day delay from YouTube --- ## Key Decisions | Decision | Choice | Rationale | |---|---|---| | Database | SQLite | Self-contained, portable, perfect for read-heavy analytics workload. No server process needed. | | Dashboard | Streamlit | Python-native, fast to build, interactive. Travis can leverage data analyst skills directly. | | API approach | YouTube Analytics API (targeted queries) | Real-time, flexible dimensions/metrics. Better than Reporting API for our scale. | | Hosting | Linode (same server) | Keeps everything centralized with existing PBS infrastructure. | --- ## Sequencing & Priority 1. **Phase 1** (API Setup) → unblocks everything 2. **Phase 2** (Python Collector) → gets data flowing, enables ad-hoc analysis immediately 3. **Phase 3** (n8n Automation) → removes manual collection burden 4. **Phase 4** (Streamlit Dashboard) → gives Jenny self-service access to insights 5. **Phase 5** (Advanced Analysis) → ongoing value extraction --- ## Relationship to Other PBS Projects - **PBS Content Hub (Phase 5):** Dashboard could eventually be a tab within the Content Hub - **Authelia SSO:** Will protect the Streamlit dashboard once rolled out - **WordPress-to-MySQL sync:** Could correlate website recipe traffic with YouTube performance - **Instagram automation:** Cross-platform analysis potential (YouTube + Instagram data in one place) --- *Next Step: Phase 1 — Set up Google Cloud project and enable YouTube APIs*