Related papers: When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs

When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs

URL: http://arxiv.org/abs/2510.22228v1
Date: Sat, 25 Oct 2025 09:22:22 GMT
Title: When Fewer Layers Break More Chains: Layer Pruning Harms Test-Time Scaling in LLMs
Authors: Keyu Wang, Tian Lyu, Guinan Su, Jonas Geiping, Lu Yin, Marco Canini, Shiwei Liu,
Abstract summary: We study the impact of layer pruning on long-chain reasoning through the lens of test-time scaling.<n>We demonstrate that pruning even one or two layers can severely impair test-time scaling.<n>These findings call for a rethinking of layer pruning strategies.
Score: 40.79077285268906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Layer pruning has emerged as a widely adopted technique for improving the efficiency of large language models (LLMs). Although existing methods demonstrate strong performance retention on general knowledge tasks, their effect on long-chain reasoning, a more brittle yet crucial capability, remains largely unexplored. In this work, we study the impact of layer pruning on long-chain reasoning through the lens of test-time scaling, a key mechanism in modern LLMs that enables strong reasoning capacity by allocating more computation at inference time. With extensive experiments, we demonstrate that pruning even one or two layers can severely impair test-time scaling, with performance collapsing drastically on long reasoning benchmarks even when performance on knowledge-intensive and shallow reasoning tasks remains stable. Furthermore, we find that standard supervised fine-tuning remedies fail to recover test-time scaling once it has deteriorated. Through in-depth analyses, we identify the mechanisms underlying this fragility of test-time scaling and highlight the fundamental risks of applying layer pruning to reasoning-intensive LLMs. These findings call for a rethinking of layer pruning strategies and provide insights for developing methods that preserve the robustness of reasoning. We open-source the codebase in \href{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}{https://github.com/keyu-wang-2002/Layer-Pruning-Harms-Inference-Scaling}.

Related papers

On the Limits of Layer Pruning for Generative Reasoning in LLMs [0.5437050212139086]
Layer pruning can compress large language models (LLMs) while retaining strong performance on classification benchmarks with little or no finetuning.<n>We find that tasks requiring multi-step reasoning are particularly sensitive to depth reduction.<n>Under realistic post-training constraints, we evaluate a simple mitigation strategy based on supervised finetuning.
arXiv Detail & Related papers (2026-02-02T11:57:22Z)
Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling [38.27469349005585]
Test-time scaling is a powerful paradigm for enhancing the reasoning capabilities of large language models.<n>Test-time scaling is inherently inefficient due to the generation of redundant and repetitive reasoning traces.<n>We introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating test-time scaling.
arXiv Detail & Related papers (2025-08-30T01:54:55Z)
Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z)
Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z)
Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models [45.938663388013445]
We show that a small set of high-impact activations in the last few layers governs long-form reasoning attributes.<n>By simply amplifying these activations and inserting "wait" tokens, we can invoke the long CoT ability without any training.
arXiv Detail & Related papers (2025-05-23T10:07:18Z)
Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [82.75174050101108]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z)
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory [79.63672515243765]
In this paper, we focus on a standard and realistic scaling setting: majority voting.<n>We show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.<n>We propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times.
arXiv Detail & Related papers (2025-05-16T08:28:57Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
Large language models (LLMs) are becoming more capable and widespread.<n>Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks.<n>In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
Understanding the Effectiveness of Coverage Criteria for Large Language Models: A Special Angle from Jailbreak Attacks [10.909463767558023]
Large language models (LLMs) have revolutionized artificial intelligence, but their deployment across critical domains has raised concerns about their abnormal behaviors when faced with malicious attacks.<n>In this paper, we conduct a comprehensive empirical study to evaluate the effectiveness of traditional coverage criteria in identifying such inadequacies.<n>We develop a real-time jailbreak detection mechanism that achieves high accuracy (93.61% on average) in classifying queries as normal or jailbreak.
arXiv Detail & Related papers (2024-08-27T17:14:21Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.