Quantization

Simon BudziakCTO
Quantization is a model compression technique that reduces the precision of numerical weights and activations in neural networks, dramatically decreasing model size and accelerating inference speed with minimal impact on quality. It is one of the most practical optimizations for deploying large language models in production environments, especially on consumer hardware or at scale.
In standard training, model parameters are stored as 32-bit floating-point numbers (FP32), providing high precision but consuming significant memory. Quantization converts these to lower-precision formats:
Quantization is essential for sustainable AI deployment—it reduces the environmental impact of AI infrastructure, lowers barriers to entry for developers, and enables privacy-preserving local deployment. As models continue growing, quantization will remain a critical technique for making cutting-edge AI accessible and economically viable.
In standard training, model parameters are stored as 32-bit floating-point numbers (FP32), providing high precision but consuming significant memory. Quantization converts these to lower-precision formats:
- FP16 (Half Precision): 16-bit floating point, cutting memory usage in half with negligible quality loss. Standard for modern GPU inference.
- INT8 (8-bit Integer): Reduces size by 4x compared to FP32. Requires careful calibration but achieves excellent quality-speed trade-offs on specialized hardware.
- INT4 (4-bit Integer): Reduces size by 8x with more noticeable quality degradation but enables running massive models (70B+) on consumer GPUs.
- Binary/Ternary: Extreme quantization (1-2 bits) for edge devices, trading significant quality for maximum compression.
- Memory Reduction: A 70B parameter model at FP32 requires ~280GB of memory. At INT8, it fits in ~70GB. At INT4, just ~35GB—making it runnable on high-end consumer hardware.
- Speed Improvement: Lower precision arithmetic is computationally cheaper. INT8 inference can be 2-4x faster than FP32, and specialized hardware (like Tensor Cores) accelerates quantized operations further.
- Cost Efficiency: Smaller models mean fewer servers, less memory, and lower cloud bills. For API providers, quantization directly impacts profitability.
- Democratization: Enables running powerful models locally without expensive infrastructure, crucial for privacy-sensitive applications and offline use cases.
- Post-Training Quantization (PTQ): Quantize a trained model without retraining. Fast and simple but may degrade quality on very low precision (INT4).
- Quantization-Aware Training (QAT): Train the model with quantization in mind, simulating low-precision during training to improve robustness. Produces higher-quality INT8/INT4 models.
- GPTQ: Advanced post-training quantization specifically designed for LLMs, using calibration data to minimize quality loss. Widely used for 4-bit quantized models.
- AWQ (Activation-aware Weight Quantization): Protects important weights from aggressive quantization based on activation patterns, preserving quality better than uniform quantization.
- GGUF/GGML: Formats optimized for CPU inference with quantization, enabling LLM deployment on machines without GPUs.
- QLoRA: Combines 4-bit quantization with LoRA fine-tuning, allowing customization of huge models on consumer GPUs (e.g., fine-tuning Llama 70B on a single RTX 4090).
- Mixed Precision: Use different precision levels for different layers (e.g., FP16 for attention layers, INT8 for feed-forward) to balance quality and performance.
- FP16: Default choice for production GPU deployments. Negligible quality loss, 2x memory savings, widely supported.
- INT8: Excellent for high-throughput API services on modern GPUs (A100, H100). ~4x memory savings with <1% quality degradation when done properly.
- INT4: Enables running frontier models locally or fitting more models per GPU in production. Expect 5-10% quality degradation but evaluate on your specific use case.
Quantization is essential for sustainable AI deployment—it reduces the environmental impact of AI infrastructure, lowers barriers to entry for developers, and enables privacy-preserving local deployment. As models continue growing, quantization will remain a critical technique for making cutting-edge AI accessible and economically viable.
Ready to Build with AI?
Lubu Labs specializes in building advanced AI solutions for businesses. Let's discuss how we can help you leverage AI technology to drive growth and efficiency.