Related papers: On the Limits of Layer Pruning for Generative Reasoning in LLMs

On the Limits of Layer Pruning for Generative Reasoning in LLMs

URL: http://arxiv.org/abs/2602.01997v1
Date: Mon, 02 Feb 2026 11:57:22 GMT
Title: On the Limits of Layer Pruning for Generative Reasoning in LLMs
Authors: Safal Shrestha, Anubhav Shrestha, Aadim Nepal, Minwu Kim, Keith Ross,
Abstract summary: Layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning.<n>We find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction.<n>Under realistic post-training constraints, we evaluate a simple mitigation strategy based on supervised finetuning.
Score: 0.5437050212139086
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent works have shown that layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning. However, existing pruning techniques often suffer severe degradation on generative reasoning tasks. Through a systematic study across multiple model families, we find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction. Beyond surface-level text degeneration, we observe degradation of critical algorithmic capabilities, including arithmetic computation for mathematical reasoning and balanced parenthesis generation for code synthesis. Under realistic post-training constraints, without access to pretraining-scale data or compute, we evaluate a simple mitigation strategy based on supervised finetuning with Self-Generated Responses. This approach achieves strong recovery on classification tasks, retaining up to 90\% of baseline performance, and yields substantial gains of up to 20--30 percentage points on generative benchmarks compared to prior post-pruning techniques. Crucially, despite these gains, recovery for generative reasoning remains fundamentally limited relative to classification tasks and is viable primarily at lower pruning ratios. Overall, we characterize the practical limits of layer pruning for generative reasoning and provide guidance on when depth reduction can be applied effectively under constrained post-training regimes.

Related papers

Gradually Compacting Large Language Models for Reasoning Like a Boiling Frog [72.4168434368873]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, but their substantial size often demands significant computational resources.<n>We propose a gradual compacting method that divides the compression process into multiple fine-grained iterations.<n>This iterative approach-reminiscent of the "boiling frog" effect-enables the model to be progressively compressed without abrupt performance loss.
arXiv Detail & Related papers (2026-02-04T06:56:52Z)
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training [76.12556589212666]
We show that curriculum post-training avoids the exponential complexity bottleneck.<n>Under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with sample complexity.<n>We establish guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to order.
arXiv Detail & Related papers (2025-11-10T18:29:54Z)
PrunedLoRA: Robust Gradient-Based structured pruning for Low-rank Adaptation in Fine-tuning [18.3077556191671]
Low-rank adaptation (LoRA) has become a widely used paradigm for parameter-efficient fine-tuning of large language models.<n>We propose textitPrunedLoRA, a new framework that leverages structured pruning to obtain highly representative low-rank adapters.
arXiv Detail & Related papers (2025-09-30T19:10:35Z)
Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z)
LARGO: Low-Rank Regulated Gradient Projection for Robust Parameter Efficient Fine-Tuning [39.56217775141507]
Low-rAnk Regulated Gradient Projection (LARGO) algorithm integrates dynamic constraints into low-rank adaptation methods.<n>LARGO achieves state-of-the-art performance across in-domain and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-06-14T08:19:11Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
SLCA++: Unleash the Power of Sequential Fine-tuning for Continual Learning with Pre-training [68.7896349660824]
We present an in-depth analysis of the progressive overfitting problem from the lens of Seq FT. Considering that the overly fast representation learning and the biased classification layer constitute this particular problem, we introduce the advanced Slow Learner with Alignment (S++) framework. Our approach involves a Slow Learner to selectively reduce the learning rate of backbone parameters, and a Alignment to align the disjoint classification layers in a post-hoc fashion.
arXiv Detail & Related papers (2024-08-15T17:50:07Z)
Effective Layer Pruning Through Similarity Metric Perspective [0.0]
Deep neural networks have been the predominant paradigm in machine learning for solving cognitive tasks. Pruning structures from these models is a straightforward approach to reducing network complexity. Layer pruning often hurts the network predictive ability (i.e., accuracy) at high compression rates. This work introduces an effective layer-pruning strategy that meets all underlying properties pursued by pruning methods.
arXiv Detail & Related papers (2024-05-27T11:54:51Z)
The Unreasonable Ineffectiveness of the Deeper Layers [5.984361440126354]
We find that removing a certain layer does not affect model performance in common question-answering benchmarks.<n>Surprisingly, with this method we find minimal degradation of performance until after a large fraction of the layers are removed.<n>From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
arXiv Detail & Related papers (2024-03-26T17:20:04Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Fast Hierarchical Learning for Few-Shot Object Detection [57.024072600597464]
Transfer learning approaches have recently achieved promising results on the few-shot detection task. These approaches suffer from catastrophic forgetting'' issue due to finetuning of base detector. We tackle the aforementioned issues in this work.
arXiv Detail & Related papers (2022-10-10T20:31:19Z)
Back to Basics: Efficient Network Compression via IMP [22.586474627159287]
Iterative Magnitude Pruning (IMP) is one of the most established approaches for network pruning. IMP is often argued that it reaches suboptimal states by not incorporating sparsification into the training phase. We find that IMP with SLR for retraining can outperform state-of-the-art pruning-during-training approaches.
arXiv Detail & Related papers (2021-11-01T11:23:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.