If you are running LLMs locally, you quickly realize that VRAM is the only currency that matters. Whether you are using an RTX 3090 or the newer RTX 5090, the goal is always the same: fit the largest, smartest model possible into your memory pool without the output turning into gibberish.
This is where quantization comes in. In simple terms, quantization reduces the precision of a model’s weights to shrink its memory footprint and increase inference speed. For those of us using GGUF files via tools like Ollama or LM Studio, the choice usually boils down to Q4_K_M, Q5_K_M, or Q8_0.
The VRAM Math

The most critical rule for local inference is that a model only runs entirely on your GPU if its quantized weight size fits within your available VRAM. For example, we found that the RTX 5090’s ~32GB pool represents a major threshold; it allows us to move away from aggressive quantization and run mid-sized models with higher fidelity.
However, you cannot just look at the model file size. When we ran Llama-3.3-70B-Instruct on our hardware, we noticed a 26GB model actually consumed 37GB during inference. That gap is almost entirely the KV cache–the memory pre-allocated to handle the context window. If you don’t leave room for the cache, your model will offload to system RAM and your tokens per second will crater.
Breaking Down the Formats

Q4_K_M (The Practical Standard)
Q4_K_M is generally the “goldilocks” zone for most tinkerers. It uses 4-bit quantization, which drastically reduces requirements. As noted by ML Journey, a 7B parameter model that would normally require 28GB of RAM at full precision can run in about 6GB when quantized to 4-bit.
In our own pipeline, we use Q4_K_M for models like DeepSeek-R1-32B (~20GB), which fits comfortably on a 32GB card while leaving plenty of room for the KV cache and system overhead.
Q5_K_M (The Quality Bump)
If you have VRAM to spare, Q5_K_M is where you start seeing a noticeable return on investment in terms of reasoning and nuance. It provides a tighter approximation of the original weights than Q4. We typically lean toward this or higher for models in the 8B parameter range–the current “sweet spot” for local inference–because the memory cost is negligible compared to the gain in stability.
Q8_0 (The Near-Lossless Option)
Q8_0 is essentially a reference point. It preserves almost all the intelligence of the original FP16 model but takes up significantly more space. While it’s great for critical tasks, it often pushes mid-sized models out of consumer VRAM. For most developers, the jump from Q5 to Q8 yields diminishing returns compared to the massive increase in memory pressure.
How to Choose Based on Your Hardware

Your choice depends entirely on which “threshold” you are trying to cross.
If you are limited to 8GB or 12GB of VRAM, Q4_K_M is your only realistic path for most modern models. You sacrifice some precision, but the alternative is a model that won’t load or runs at a snail’s pace on your CPU.
If you are running an RTX 5090 with 32GB of VRAM, you have more flexibility. We adopted a strategy of using Q4_K_M for models like Qwen2.5-32B (~20GB) to ensure strong writing quality while staying within the GPU pool. For us, the priority is keeping the weights on the GPU; we would rather run a slightly smaller model at Q8 than a massive 70B model at an unusable bit depth that forces CPU offloading.
Final Verdict
For 90% of use cases, stick with Q4_K_M. It provides the best balance of speed and intelligence for consumer hardware. Only move to Q5_K_M or Q8_0 if you have verified your VRAM headroom using a VRAM calculator and confirmed that the KV cache won’t push you into system RAM. The goal isn’t to have the highest bit-depth; it’s to fit the most capable model your hardware can handle without breaking the inference speed.



