Fighting VRAM collisions and API drift

What we shipped on 2026-06-21

The “GPU metrics STALE” alarm in PR #1796 was a wake-up call–we weren’t just missing data, we were blind. After our deploy-clone cutover, the poindexter-gpu-exporter container couldn’t reach the NVIDIA driver on Windows Docker Desktop, leaving us with “Driver Not Loaded” and zero metrics while the actual hardware was humming along perfectly at 40 °C. To fix it, we sourced nvidia_gpu directly from the host exporter and repaired the gpu-scraper.py endpoints (PR #1796).

While the monitoring was blind, the VRAM was actually colliding. We share a single 32 GB GPU across Ollama, SDXL, and wan video renders. While we had locks for media renders, scheduled worker jobs for SEO and newsletter research were hitting the LLM via dispatcher.dispatch_complete without acquiring any lock (PR #1794). The result was a classic crash: gemma-4-31B pinned 19 GB of VRAM while a render tried to claim its own 14.8 GB, exceeding our physical limit. We solved this by implementing a reentrant gpu.lock in gpu_scheduler.py using a module-level ContextVar. Now, any async call chain already holding the GPU can pass through without triggering a redundant evict-and-reacquire cycle (PR #1794).

We also spent today cleaning up our API contracts as part of the ongoing response-contract sweep (PR #745). We’ve been moving toward a canonical offset envelope–{items, total, limit, offset}–to stop the proliferation of bespoke shapes. This meant converting GET /api/topics/proposals (PR #1797), the data-plane surface list (PR #1795), and video episodes (PR #1793) into this standard format. For the data-plane, we used dict[str, Any] for items because forcing a named model across five different table schemas would have been an over-engineering nightmare (PR #1795).

Scaling issues also crept into our video endpoint. list_video_episodes was globbing every *.mp4 in the VIDEO_DIR without any bounds, a linear-growth problem we’d already solved for podcasts (PR #1798). We mirrored the podcast logic exactly, introducing limit and offset queries to keep the response size predictable (PR #1798).

On the infra side, we had to retire port 15432 because it fell into a Windows Hyper-V reserved range, causing Docker to silently drop the publish. We’ve shifted the local Postgres host port to 5433 and updated setup.py to ensure fresh installs don’t default to a dead port (PR #1792). To prevent this drift from happening again, we added TestLocalDbPortInvariant, which reads docker-compose.local.yml at test time to assert the constants match (PR #1792).

Finally, we closed the gap between detection and action for our compose stack. The brain could detect drift, but it couldn’t fix it because running docker compose up inside a Linux container mangles Windows bind sources. We now route these self-heal requests to the host Recovery Agent on port 9841, which executes start-stack.sh up -d --no-build where the binds actually resolve (PR #1791).

The API is finally starting to feel like a cohesive product rather than a collection of endpoints. Now that the VRAM collisions are serialized and the monitoring is restored, we can actually trust the hardware under load.

Auto-compiled by Poindexter from today’s commits and PRs. See the work: github.com/Glad-Labs/poindexter.