Related papers: Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration

URL: http://arxiv.org/abs/2601.02674v1
Date: Tue, 06 Jan 2026 03:09:31 GMT
Title: Iterative Structured Pruning for Large Language Models with Multi-Domain Calibration
Authors: Guangxin Wu, Hao Zhang, Zhang Zhibin, Jiafeng Guo, Xueqi Cheng,
Abstract summary: Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks.<n>Their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency.<n>In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators.
Score: 73.40887151631088
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide spectrum of natural language processing tasks. However, their ever-growing scale introduces significant barriers to real-world deployment, including substantial computational overhead, memory footprint, and inference latency. While model pruning presents a viable solution to these challenges, existing unstructured pruning techniques often yield irregular sparsity patterns that necessitate specialized hardware or software support. In this work, we explore structured pruning, which eliminates entire architectural components and maintains compatibility with standard hardware accelerators. We introduce a novel structured pruning framework that leverages a hybrid multi-domain calibration set and an iterative calibration strategy to effectively identify and remove redundant channels. Extensive experiments on various models across diverse downstream tasks show that our approach achieves significant compression with minimal performance degradation.

Related papers

Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z)
Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models [99.85131798240808]
We introduce a novel generative framework called textitGuided Topology Diffusion (GTD)<n>Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process.<n>At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards.<n>Experiments show that GTD can generate highly task-adaptive, sparse, and efficient communication topologies.
arXiv Detail & Related papers (2025-10-09T05:28:28Z)
OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation [91.45421429922506]
OneCAT is a unified multimodal model that seamlessly integrates understanding, generation, and editing.<n>Our framework eliminates the need for external components such as Vision Transformers (ViT) or vision tokenizer during inference.
arXiv Detail & Related papers (2025-09-03T17:29:50Z)
Application-Specific Component-Aware Structured Pruning of Deep Neural Networks via Soft Coefficient Optimization [1.6874375111244326]
It remains critical to ensure that application-specific performance characteristics are preserved during compression.<n>In structured pruning, where groups of structurally coherent elements are removed, conventional importance metrics frequently fail to maintain these essential performance attributes.<n>We propose an enhanced importance metric framework that not only reduces model size but also explicitly accounts for application-specific performance constraints.
arXiv Detail & Related papers (2025-07-20T09:50:04Z)
Tady: A Neural Disassembler without Structural Constraint Violations [14.794789423601552]
We introduce Tady, a novel neural disassembler featuring an improved model architecture and a dedicated post-processing algorithm.<n>We show that Tady effectively eliminates structural constraint violations and functions with high efficiency, while maintaining instruction-level accuracy.
arXiv Detail & Related papers (2025-06-16T10:11:43Z)
Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z)
Gatekeeper: Improving Model Cascades Through Confidence Tuning [45.46791873454989]
We introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups.<n>Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model.
arXiv Detail & Related papers (2025-02-26T17:29:08Z)
DepGraph: Towards Any Structural Pruning [68.40343338847664]
We study general structural pruning of arbitrary architecture like CNNs, RNNs, GNNs and Transformers. We propose a general and fully automatic method, emphDependency Graph (DepGraph), to explicitly model the dependency between layers and comprehensively group parameters for pruning. In this work, we extensively evaluate our method on several architectures and tasks, including ResNe(X)t, DenseNet, MobileNet and Vision transformer for images, GAT for graph, DGCNN for 3D point cloud, alongside LSTM for language, and demonstrate that, even with a
arXiv Detail & Related papers (2023-01-30T14:02:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.