lubu labs

Quantization

Simon Budziak
Simon BudziakCTO
Quantization is a model compression technique that reduces the precision of numerical weights and activations in neural networks, dramatically decreasing model size and accelerating inference speed with minimal impact on quality. It is one of the most practical optimizations for deploying large language models in production environments, especially on consumer hardware or at scale.

In standard training, model parameters are stored as 32-bit floating-point numbers (FP32), providing high precision but consuming significant memory. Quantization converts these to lower-precision formats:
  • FP16 (Half Precision): 16-bit floating point, cutting memory usage in half with negligible quality loss. Standard for modern GPU inference.
  • INT8 (8-bit Integer): Reduces size by 4x compared to FP32. Requires careful calibration but achieves excellent quality-speed trade-offs on specialized hardware.
  • INT4 (4-bit Integer): Reduces size by 8x with more noticeable quality degradation but enables running massive models (70B+) on consumer GPUs.
  • Binary/Ternary: Extreme quantization (1-2 bits) for edge devices, trading significant quality for maximum compression.
The benefits of quantization for LLM deployment are substantial:
  • Memory Reduction: A 70B parameter model at FP32 requires ~280GB of memory. At INT8, it fits in ~70GB. At INT4, just ~35GB—making it runnable on high-end consumer hardware.
  • Speed Improvement: Lower precision arithmetic is computationally cheaper. INT8 inference can be 2-4x faster than FP32, and specialized hardware (like Tensor Cores) accelerates quantized operations further.
  • Cost Efficiency: Smaller models mean fewer servers, less memory, and lower cloud bills. For API providers, quantization directly impacts profitability.
  • Democratization: Enables running powerful models locally without expensive infrastructure, crucial for privacy-sensitive applications and offline use cases.
Modern quantization techniques go beyond simple number format conversion:
  • Post-Training Quantization (PTQ): Quantize a trained model without retraining. Fast and simple but may degrade quality on very low precision (INT4).
  • Quantization-Aware Training (QAT): Train the model with quantization in mind, simulating low-precision during training to improve robustness. Produces higher-quality INT8/INT4 models.
  • GPTQ: Advanced post-training quantization specifically designed for LLMs, using calibration data to minimize quality loss. Widely used for 4-bit quantized models.
  • AWQ (Activation-aware Weight Quantization): Protects important weights from aggressive quantization based on activation patterns, preserving quality better than uniform quantization.
  • GGUF/GGML: Formats optimized for CPU inference with quantization, enabling LLM deployment on machines without GPUs.
Quantization is particularly impactful when combined with other techniques:
  • QLoRA: Combines 4-bit quantization with LoRA fine-tuning, allowing customization of huge models on consumer GPUs (e.g., fine-tuning Llama 70B on a single RTX 4090).
  • Mixed Precision: Use different precision levels for different layers (e.g., FP16 for attention layers, INT8 for feed-forward) to balance quality and performance.
Practical deployment considerations:
  • FP16: Default choice for production GPU deployments. Negligible quality loss, 2x memory savings, widely supported.
  • INT8: Excellent for high-throughput API services on modern GPUs (A100, H100). ~4x memory savings with <1% quality degradation when done properly.
  • INT4: Enables running frontier models locally or fitting more models per GPU in production. Expect 5-10% quality degradation but evaluate on your specific use case.
Tools like llama.cpp, vLLM, TensorRT-LLM, and bitsandbytes have made quantization accessible to developers without deep ML expertise. Many model hubs (Hugging Face, Ollama) provide pre-quantized versions of popular models, making deployment as simple as downloading a different file.

Quantization is essential for sustainable AI deployment—it reduces the environmental impact of AI infrastructure, lowers barriers to entry for developers, and enables privacy-preserving local deployment. As models continue growing, quantization will remain a critical technique for making cutting-edge AI accessible and economically viable.

Ready to Build with AI?

Lubu Labs specializes in building advanced AI solutions for businesses. Let's discuss how we can help you leverage AI technology to drive growth and efficiency.