The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

arXiv AI 15 Jun 2026 2 min read

AI Insight

This study addresses instability in FP4 (4-bit floating point) quantized training of large language models by identifying that dominant activation outliers stem from a coherent rank-one mean bias rather than random sparse events. The researchers developed Averis, a mean-residual splitting quantization method that isolates this mean component before quantization, enabling more stable low-precision training. Testing on Qwen3 models showed Averis reduced training loss gaps compared to existing methods while maintaining computational efficiency with only 2.20% overhead.

Why it matters

This work could significantly reduce memory and computational requirements for training large language models, making advanced AI development more accessible and energy-efficient. The method's low overhead and compatibility with existing approaches provides a practical path toward deploying 4-bit precision training in production environments.

Confidence

6/10Peer-reviewedAI & Computational Science

arXiv:2603.10444v2 Announce Type: replace-cross
Abstract: FP4 training promises substantial memory and compute savings for large language models, but remains fragile because blockwise quantization is dictated by extreme activation magnitudes, which inflate dynamic range and compress long-tail signals. We identify a counterintuitive source of this failure: dominant activation outliers are not merely arbitrary sparse events, but are largely induced by a coherent rank-one mean bias, whose direction aligns with the leading anisotropic spectral component. This mean component strengthens during training, is amplified and reshaped by attention and FFN operators, and increasingly dominates top activation magnitudes. Crucially, this discovery reveals that a seemingly complex outlier-suppression problem admits a truly simple solution: isolate the coherent mean before quantization. We therefore propose Averis, a mean-residual splitting quantization method that separates the mean component using only reductions and elementwise subtractions before FP4 quantization. Across Qwen3 0.6B Dense trained on 100B tokens and Qwen3 7B A1.5B MoE trained on 50B tokens, Averis enables robust W4A4G4 FP4 training, reducing BF16 loss gaps to 1.19%/0.81% versus 2.05%/1.10% for NVIDIA’s recently released Hadamard-based outlier-smoothing method, while limiting downstream gaps to 0.89/0.71 points. With only 2.20% end-to-end overhead over vanilla NVFP4, about 30% of NVIDIA’s Hadamard-based design, Averis provides a hardware-efficient path to stable low-bit LLM training. Complementary to Hadamard, Averis further reduces the Qwen3-0.6B loss and downstream gaps to 0.94% and 0.73 points when combined. Code is available at: https://anonymous.4open.science/r/averis-504D.

Source: The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training