Efficient Large Language Model Compression via Post-Training Quantization and Knowledge Distillation

Zhi Ci; Jianggu Xi; Yongkan Zhou

doi:10.71465/fapm339

Authors

Zhi Ci Peking University, Beijing 100871, China Author
Jianggu Xi Peking University, Beijing 100871, China Author
Yongkan Zhou Peking University, Beijing 100871, China Author

DOI:

https://doi.org/10.71465/fapm339

Keywords:

Large Language Models, Model Compression, Post-Training Quantization, Knowledge Distillation, Efficient NLP

Abstract

The proliferation of Large Language Models (LLMs) has revolutionized natural language processing, yet their colossal size and computational demands pose significant barriers to deployment, particularly in resource-constrained environments. Model compression has emerged as a critical field to mitigate these challenges. This paper investigates a hybrid compression strategy that synergistically combines Post-Training Quantization (PTQ) and Knowledge Distillation (KD). The primary objective is to develop a framework that significantly reduces the memory footprint and inference latency of LLMs while preserving their task performance to the greatest extent possible. We propose a sequential methodology termed Distillation-Quantization Fusion (DQF), wherein a smaller "student" model is first trained to mimic the output distributions of a larger "teacher" LLM through knowledge distillation. Subsequently, the distilled student model undergoes an aggressive 4-bit post-training quantization. This study presents an empirical analysis based on a simulated framework, evaluating the compressed models on a suite of standard natural language understanding benchmarks. Our findings indicate that the DQF approach achieves a superior trade-off between model size and performance compared to standalone PTQ or KD. The distilled-then-quantized model demonstrates only a marginal performance degradation relative to the original teacher model but offers a compression ratio exceeding 15x. This research underscores the efficacy of combining distinct compression paradigms to create highly efficient and deployable LLMs, thereby contributing to the democratization of advanced AI capabilities.

Downloads

Download data is not yet available.

Efficient Large Language Model Compression via Post-Training Quantization and Knowledge Distillation

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

Journal Information

Latest publications

Information

Make a Submission

Keywords