Dynamic Structured Pruning for LLMs based on Real-Time Sensitivity Analysis

Authors

  • Gouban Zhu Nanjing University, Nanjing 210093, China Author
  • Chou Cui Nanjing University, Nanjing 210093, China Author
  • Yao Zhang wylshcn@163.com Author

DOI:

https://doi.org/10.71465/fair338

Keywords:

Large Language Models, Structured Pruning, Model Compression, Sensitivity Analysis, Dynamic Sparsity

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their immense size and computational requirements pose significant barriers to deployment, particularly in resource-constrained environments. Model pruning has emerged as a promising technique for compressing LLMs, but conventional static pruning methods, which apply a fixed sparsity mask, are often suboptimal for the dynamic and varied nature of real-world inputs. This paper introduces a novel framework for Dynamic Structured Pruning for LLMs based on Real-Time Sensitivity Analysis (DSGP). The core objective of this research is to develop a method that dynamically adapts the model's architecture at inference time by identifying and pruning less salient components based on their sensitivity to the specific input query. Our methodology involves a lightweight, real-time sensitivity analysis module that calculates the importance of structured components, such as attention heads and feed-forward network neurons, on a per-inference basis. A pruning mask is then generated and applied dynamically, resulting in a transient, input-specific sub-network. Through a series of simulated experiments on benchmark models and datasets, our findings demonstrate that the DSGP framework can achieve up to a 40% reduction in floating-point operations (FLOPs) and a 30% decrease in inference latency compared to statically pruned models, while incurring a negligible performance degradation of less than 1% on the GLUE benchmark. This research establishes the viability of input-dependent dynamic pruning and offers a significant contribution towards deploying high-performance, computationally efficient LLMs on edge devices and in latency-sensitive applications.

Downloads

Download data is not yet available.

Downloads

Published

2025-09-07