DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations that impede efficient low-bit representation. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes.

However, these methods struggle with smoothing Massive Outliers that display significantly larger values, which leads to significant performance degradation in low-bit quantization.

In this paper, we introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers. First, DuQuant starts by constructing the rotation matrix, using specific outlier dimensions as prior knowledge, to redistribute outliers to adjacent channels by block-wise rotation. Second, We further employ a zigzag permutation to balance the distribution of outliers across blocks, thereby reducing block-wise variance. A subsequent rotation further smooths the activation landscape, enhancing model performance. DuQuant establishs new state-of-the-art baselines for 4-bit weight-activation quantization across various model types and downstream tasks.

Motivation

We first discover that massive outliers exist at the input of the down-projection layer within the FFN module. These outliers exhibit extremely large magnitudes and are confined to a limited number of tokens. We further observe that these massive outliers are not effectively managed by existing quantization methods (e.g., SmoothQuant), leading to significant performance degradation in low-bit weight-activation quantization. This motivates us to develop a novel approach that can better eliminate both massive and normal outliers.

Dual Transformation

We propose Rotation and Permutation Transformation to redistribute outliers among adjacent channels and different blocks. The figure (a) shows the sequential transformations on Normal Outliers: ① initial rotation to reduce outliers within blocks, ② permutation to evenly distribute outliers across blocks, and ③ a second rotation for further smoothing. The figure (b) presents the activation changes for Massive Outliers after applying DuQuant. The figure (c) gives a sample matrix for highlighting the continual reduction of outliers through rotation and permutation, with outliers marked in dark blue.

Results

DuQuant achieves SoTA performance in PPL evaluation under W4A4 quantization.

DuQuant showcases robustness towards LLaMA3-8B quantization.

DuQuant maintains performance comparable to FP16 models on the LongBench benchmark.

Visualization

The above figure shows the activation distribution of the LLaMA2-70B model before and after applying DuQuant. More detailed examples of activation change for LLMs after applying DuQuant are presented in paper Appendix. These visualizations prove that our DuQuant effectively redistributes outliers and smooths the activation landscape.

@inproceedings{lin2024duquant,
          title={Duquant: Distributing outliers via dual transformation makes stronger quantized llms},
          author={Lin, Haokun and Xu, Haobo and Wu, Yichen and Cui, Jingzhi and Zhang, Yingtao and Mou, Linzhan and Song, Linqi and Sun, Zhenan and Wei, Ying},
          booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
          year={2024}}

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

NeurIPS 2024 Oral

Abstract

Motivation

Dual Transformation

Results

Visualization

BibTeX