Taming the cadvisor leak and cleaning up LLM garbage

What we shipped on 2026-07-01

We spent most of today fighting an OOM cascade that nearly took down the WSL2 VM, starting with our attempt to cap cadvisor memory so it couldn’t starve the system (PR #2019). The proximate cause was a slow leak–cadvisor climbed to ~4 GB over 19 hours, eroding headroom until routine workload spikes triggered [Errno 12] Cannot allocate memory across the board. We initially capped it at 512m, but live measurement proved that too tight; a fresh restart immediately climbed to ~490 MB for the initial cgroup scan and metric registry, causing immediate OOM restarts. We’ve since raised the cap to 1g (PR #2021), giving us 2x headroom over the measured working set while still keeping it 4x below the runaway peak that crashed the pipeline.

While the infra was stabilizing, we caught a disturbing trend in our content generation where two tasks produced complete garbage–one essentially writing a Wikipedia article on punctuation marks and another leaking audit logs (PR #2016). The culprit wasn’t RAG pollution in the way we suspected, but a latent defect in two_pass_writer._embed_and_fetch_snippets. It was querying the entire embeddings table without a source_table filter, meaning session transcripts and brain ops-logs were being pulled as context for content drafts. We’ve now scoped those snippets strictly to content sources and stripped out prompt-echo preambles.

We also tightened up the topic sweep pipeline after a run_niche_topic_sweep job failed due to truncated LLM JSON (PR #2020). The triage model hallucinated an explanation about podcast scripts, but the reality was simpler: the model hit a length cap and stopped mid-object. We’ve updated llm_final_score to handle JSONDecodeError by logging a warning and degrading gracefully to embedding pre-rank rather than bubbling the error up and killing the entire job.

On the CLI side, we chased a phantom OAuth credential error that was actually a network fluke (PR #2017). The “No CLI OAuth credentials configured” message was masking a ConnectionResetError: [WinError 64] caused by the WSL2 host port-proxy resetting SSL handshakes with Postgres. We’ve separated these reports so we aren’t hunting for missing keys when the network is just flaky. We also fixed a ModuleNotFoundError where pipeline resume and regen crashed outside the repo root because the editable install didn’t include the brain/ directory in its .pth file (PR #2011).

Finally, we addressed some embarrassing drift in our public onboarding docs (PR #2018). A fresh poindexter setup was hitting broken commands and un-pullable model defaults. To stop this from happening again, we added test_package_metadata, which parses the README’s fenced blocks and asserts that every command resolves against the Click command tree.

The VM is stable now, and the writer is finally staying on topic. We’re still tuning the balance between LLM autonomy and strict filtering, but the system is no longer eating itself from the inside out.

Auto-compiled by Poindexter from today’s commits and PRs. See the work: github.com/Glad-Labs/poindexter.