Dynamic Structured Pruning for LLMs based on Real-Time Sensitivity Analysis
DOI:
https://doi.org/10.71465/fair338Keywords:
Large Language Models, Structured Pruning, Model Compression, Sensitivity Analysis, Dynamic SparsityAbstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their immense size and computational requirements pose significant barriers to deployment, particularly in resource-constrained environments. Model pruning has emerged as a promising technique for compressing LLMs, but conventional static pruning methods, which apply a fixed sparsity mask, are often suboptimal for the dynamic and varied nature of real-world inputs. This paper introduces a novel framework for Dynamic Structured Pruning for LLMs based on Real-Time Sensitivity Analysis (DSGP). The core objective of this research is to develop a method that dynamically adapts the model's architecture at inference time by identifying and pruning less salient components based on their sensitivity to the specific input query. Our methodology involves a lightweight, real-time sensitivity analysis module that calculates the importance of structured components, such as attention heads and feed-forward network neurons, on a per-inference basis. A pruning mask is then generated and applied dynamically, resulting in a transient, input-specific sub-network. Through a series of simulated experiments on benchmark models and datasets, our findings demonstrate that the DSGP framework can achieve up to a 40% reduction in floating-point operations (FLOPs) and a 30% decrease in inference latency compared to statically pruned models, while incurring a negligible performance degradation of less than 1% on the GLUE benchmark. This research establishes the viability of input-dependent dynamic pruning and offers a significant contribution towards deploying high-performance, computationally efficient LLMs on edge devices and in latency-sensitive applications.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Gouban Zhu, Chou Cui, Yao Zhang (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.