If you have tried running LLMs locally, you know that VRAM is the only currency that matters. Whether you are using an RTX 3090 or a newer RTX 5090, the goal is always to fit the largest, smartest model possible into your available memory. We’ve discussed this before when looking at choosing quantization formats, but there is a hidden tax that hits you the moment you start using long contexts: the KV cache.
The VRAM Gap: Why Your Model “Grows” During Inference

You might notice a frustrating discrepancy when checking your GPU usage. In our own testing, we saw a Llama-3.3-70B-Instruct model that occupied 26GB on disk but ballooned to 37GB during inference. That ~11 GB gap isn’t a leak; it is the KV cache pre-allocated for the context window.
The KV (Key-Value) cache stores the mathematical representations of previous tokens so the GPU doesn’t have to recompute them every time it generates a new word. While this speeds up generation, it consumes massive amounts of VRAM as the conversation grows. For indie developers and tinkerers, this is where “long context” becomes a hardware wall. Even with an RTX 5090’s 32GB pool, you can quickly run out of memory if your cache is stored in high-precision formats.
Breaking the Memory Wall with Quantization
Quantizing the model weights (e.g., moving from FP16 to INT4) saves space for the model itself, but KV cache quantization targets the memory used during the conversation.
Recent research highlights how critical this is for efficiency. A review on KV cache compression notes that optimizing this cache is crucial for enhancing performance during inference. By reducing the precision of the keys and values stored in memory, we can fit significantly more tokens into the same amount of VRAM.
The impact is tangible. For those using NVIDIA Blackwell GPUs, NVFP4 KV cache quantization can reduce the memory footprint by 50% compared to FP8. This allows developers to double their context length or batch size while maintaining high accuracy–reporting less than 1% loss on benchmarks like MMLU-PRO and LiveCodeBench.
Scaling Toward “Infinite” Context

For those pushing toward extreme contexts, the industry is moving toward even more aggressive compression. Research into KVQuant explores the possibility of reaching 10 million context lengths through advanced quantization techniques. Other approaches, such as KVC-Q, focus on high-fidelity dynamic quantization to ensure that the model doesn’t lose its “train of thought” as the cache shrinks.
In our work, we’ve found that the jump to 32GB VRAM on the RTX 5090 is a critical threshold. It shifts the experience from compromising model quality just to get it to load, to actually having room to breathe. However, without KV cache quantization, even 32GB can be swallowed by a single long-document analysis task.
The Local Developer’s Trade-off
The right strategy depends on your specific workload. As noted in recent inference research, no single approach wins across every GPU and batch size.
If you are running a 7B or 8B model–the current “sweet spot” for local inference–you have more breathing room. But if you are pushing 70B models on consumer gear, KV cache quantization is no longer optional; it is the only way to handle large datasets without triggering an out-of-memory (OOM) error.
By combining weight quantization with KV cache optimization, we can move high-performance AI from expensive cloud APIs and into the home lab. It transforms the RTX 5090 from a tool that just “runs” a model into a workstation capable of processing entire codebases locally.



