
If you have tried running LLMs locally, you know that VRAM is the only currency that matters. Whether you are using an RTX 3090 or a newer RTX 5090, the goal is always to fit the largest, smartest model possible into your available memory pool.
The math for native models is brutal. A 70 billion parameter model typically requires 140 GB of memory when stored in 16-bit floating point (FP16). That is far beyond the reach of any single consumer GPU. Even smaller models are heavy: a 7B parameter model in FP32 takes 28 GB, and 14 GB in FP16.
Quantization is the technical workaround that makes local AI viable. It allows us to trade a small amount of precision for a massive reduction in memory footprint.
How Quantization Works

At its simplest, quantization converts weights from high precision (like FP16) to low precision (like INT4). Instead of using 16 bits to represent a number, we use four. This immediately reduces memory by 75%, often with an accuracy loss below 1%.
This shift changes the hardware requirements entirely. A 7B model that took 14 GB in FP16 drops to 3.5 GB in INT4–small enough to run on a basic laptop. This is why we can now fit massive models, such as LLaMA-30B, on an RTX 3090 GPU.
Beyond Naïve Quantization: GPTQ, AWQ, and QLoRA
Early quantization was “naïve,” meaning it applied the same reduction across all weights regardless of their importance. Modern techniques are smarter.
GPTQ and GGML have become standards for shrinking models while maintaining performance. Another breakthrough is AWQ, which found that protecting just 1% of critical weights enables nearly lossless 4-bit quantization, allowing LLaMA-70B to fit on a single RTX 4090 (24GB).
For those who want to do more than just run models, QLoRA combines 4-bit quantization with LoRA fine-tuning. This allows you to fine-tune a 65B model on a single 48GB GPU with quality comparable to full-precision tuning.
The RTX 5090 and the 32GB Threshold

In our own testing and deployments, we’ve found that parameter count is a proxy, not a rule. A model only runs entirely on the GPU if its quantized VRAM weight size fits the available pool. For the RTX 5090, that pool is approximately 32GB.
The jump from 24GB to 32GB isn’t just a linear upgrade; it’s a threshold. It moves us away from aggressive quantization–where quality suffers–toward running mid-sized models with higher fidelity. For example, on a 5090: - Qwen2.5-32B at Q6_K (~26GB) fits comfortably. - DeepSeek-R1-32B at Q4_K_M (~20GB) fits and provides strong reasoning.
However, 70B models remain marginal. While INT4 quantization lets you run 70B models on consumer hardware, a Q4_K_M 70B still struggles to fit within the 32GB limit of a single card without further compromise. If you are weighing different formats, we’ve broken down the trade-offs between GGUF Q4_K_M vs Q5_K_M vs Q8_0.
The New Local Standard
The industry is shifting away from chasing raw parameter counts and toward inference efficiency. We are seeing a “sweet spot” emerge around the 7B and 8B parameter range–like Llama 3.1 (8B)–which balances reasoning capability with hardware requirements.
By combining quantization with efficient memory management, such as PagedAttention in engines like vLLM, we can move high-performance AI out of the cloud and onto local workstations. For most developers building coding assistants or internal tools, a quantized local model is now a viable, private alternative to expensive APIs.



