Related papers: STADE: Standard Deviation as a Pruning Metric

STADE: Standard Deviation as a Pruning Metric

URL: http://arxiv.org/abs/2503.22451v2
Date: Fri, 05 Sep 2025 14:22:32 GMT
Title: STADE: Standard Deviation as a Pruning Metric
Authors: Diego Coello de Portugal Mecke, Haya Alyoussef, Maximilian Stubbemann, Ilia Koloiarov, Tom Hanika, Lars Schmidt-Thieme,
Abstract summary: State-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient.<n>A theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method.<n>Experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings.
Score: 7.026746633056497
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/

Related papers

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z)
From LLMs to LRMs: Rethinking Pruning for Reasoning-Centric Models [17.998434546981738]
Large language models (LLMs) are increasingly costly to deploy, motivating extensive research on model pruning.<n>We conduct a controlled study of pruning for both instruction-following ($textbfLLM-instruct$) and reasoning-augmented ($textbfLLM-think$) models.<n>We evaluate static depth pruning, static width pruning, and dynamic pruning across 17 tasks spanning classification, generation, and reasoning.
arXiv Detail & Related papers (2026-01-26T03:01:39Z)
How to Set the Learning Rate for Large-Scale Pre-training? [73.03133634525635]
We formalize this investigation into two distinct research paradigms: Fitting and Transfer.<n>Within the Fitting Paradigm, we introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n3) to O(n*C_D*C_) via predictive modeling.<n>We extend the principles of $$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons.
arXiv Detail & Related papers (2026-01-08T15:55:13Z)
Breaking Expert Knowledge Limits: Self-Pruning for Large Language Models [21.22854931342453]
Large language models (LLMs) have achieved remarkable performance on a wide range of tasks, hindering real-world deployment due to their massive size.<n>Existing pruning methods rely heavily on manual design pruning algorithms, thereby leading to textithuge labor costs and textitrequires expert knowledge.<n>We propose a novel pruning method called textbfAutoPrune, which first overcomes expert knowledge limits by leveraging LLMs to design optimal pruning algorithms for themselves automatically without any expert knowledge.
arXiv Detail & Related papers (2025-11-19T12:38:21Z)
Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining [6.578456055730258]
Post-training pruning is a promising approach for reducing model size and inference latency without the need for retraining.<n>We introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained large language models without retraining.<n>Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates.
arXiv Detail & Related papers (2025-08-18T16:19:22Z)
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning [75.31797502976802]
We evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks.<n>We find that most models that succeed in math fail to transfer their gains to other domains.<n>Our results suggest a need to rethink standard post-training recipes.
arXiv Detail & Related papers (2025-07-01T05:23:05Z)
LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z)
Improved Methods for Model Pruning and Knowledge Distillation [3.8993503758122663]
MAMA Pruning is a performance optimization technique for large language models like R1 or o3-mini.<n>It effectively reduces model size and computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels.<n>Preliminary experimental results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels and different downstream computational linguistics tasks.
arXiv Detail & Related papers (2025-05-20T07:53:40Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Beyond One-Size-Fits-All Pruning via Evolutionary Metric Search for Large Language Models [18.57876883968734]
We introduce textbftextscOptiShear, an efficient evolutionary optimization framework for adaptive LLM pruning.<n>Our framework features two key innovations: an effective search space built on our Meta pruning metric, and a model-wise reconstruction error for rapid evaluation.
arXiv Detail & Related papers (2025-02-15T09:17:38Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models [42.124670377223175]
We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR)<n>With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios.
arXiv Detail & Related papers (2024-12-09T13:02:35Z)
P$^2$ Law: Scaling Law for Post-Training After Model Pruning [25.07013858614455]
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs)<n>To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation.<n>To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data.
arXiv Detail & Related papers (2024-11-15T15:28:42Z)
Pruning Foundation Models for High Accuracy without Retraining [48.256389781305415]
It is challenging to deploy foundation models or large language models (LLMs) due to their massive parameters and computations. Post-training pruning methods are proposed to prune LLMs in one-shot without retraining. Our experiments demonstrate the superior performance of the proposed methods in comparison to SOTA baselines.
arXiv Detail & Related papers (2024-10-21T01:23:34Z)
TSVD: Bridging Theory and Practice in Continual Learning with Pre-trained Models [103.45785408116146]
Continual learning (CL) aims to train a model that can solve multiple tasks presented sequentially.<n>Recent CL approaches have achieved strong performance by leveraging large pre-trained models that generalize well to downstream tasks.<n>However, such methods lack theoretical guarantees, making them prone to unexpected failures.<n>We aim to bridge this gap by designing a simple CL method that is theoretically sound and highly performant.
arXiv Detail & Related papers (2024-10-01T12:58:37Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [27.310894780313618]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints. We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes. In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z)
A Simple and Effective Pruning Approach for Large Language Models [58.716255689941896]
Large Languages Models (LLMs) are natural candidates for network pruning methods. Existing methods, however, require either retraining, or solving a weight reconstruction problem reliant on second-order information. We introduce a novel, straightforward yet effective pruning method, termed Wanda (Pruning by Weights and activations), designed to induce sparsity in pretrained LLMs.
arXiv Detail & Related papers (2023-06-20T17:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.