Efficient Large Language Model Compression via Post-Training Quantization and Knowledge Distillation
DOI:
https://doi.org/10.71465/fapm339Keywords:
Large Language Models, Model Compression, Post-Training Quantization, Knowledge Distillation, Efficient NLPAbstract
The proliferation of Large Language Models (LLMs) has revolutionized natural language processing, yet their colossal size and computational demands pose significant barriers to deployment, particularly in resource-constrained environments. Model compression has emerged as a critical field to mitigate these challenges. This paper investigates a hybrid compression strategy that synergistically combines Post-Training Quantization (PTQ) and Knowledge Distillation (KD). The primary objective is to develop a framework that significantly reduces the memory footprint and inference latency of LLMs while preserving their task performance to the greatest extent possible. We propose a sequential methodology termed Distillation-Quantization Fusion (DQF), wherein a smaller "student" model is first trained to mimic the output distributions of a larger "teacher" LLM through knowledge distillation. Subsequently, the distilled student model undergoes an aggressive 4-bit post-training quantization. This study presents an empirical analysis based on a simulated framework, evaluating the compressed models on a suite of standard natural language understanding benchmarks. Our findings indicate that the DQF approach achieves a superior trade-off between model size and performance compared to standalone PTQ or KD. The distilled-then-quantized model demonstrates only a marginal performance degradation relative to the original teacher model but offers a compression ratio exceeding 15x. This research underscores the efficacy of combining distinct compression paradigms to create highly efficient and deployable LLMs, thereby contributing to the democratization of advanced AI capabilities.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Zhi Ci (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.