Related papers: Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models

URL: http://arxiv.org/abs/2602.04019v2
Date: Sun, 08 Feb 2026 01:19:37 GMT
Title: Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models
Authors: Yichen Xu, Yuyang Liang, Shan Dai, Tianyang Hu, Tsz Nam Chan, Chenhao Ma,
Abstract summary: Large language models (LLMs) continue to grow, making parameter-efficient fine-tuning the default strategy for downstream adaptation.<n>Current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection.<n>This paper develops a unified projected residual view of PEFT on top of a frozen base model.
Score: 19.448467763421707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) continue to grow, the cost of full-parameter fine-tuning has made parameter-efficient fine-tuning (PEFT) the default strategy for downstream adaptation. Constraints from inference latency in scalable serving and fine-tuning cost in edge or rapid-deployment settings make the choice of which layers to fine-tune unavoidable. Yet current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection. This paper develops a unified projected residual view of PEFT on top of a frozen base model. Under a local quadratic approximation, layerwise adaptation is governed by three quantities: (i) the projected residual norm (resnorm), which measures how much correctable bias a layer can capture; (ii) the activation energy, which determines feature conditioning; and (iii) layer coupling, which quantifies how strongly residuals interact across layers. We show that, for squared loss and linear adapters, the resnorm equals a normalized gradient norm, activation energy controls ill-conditioning and noise amplification, and weak coupling yields approximately additive layerwise contributions. Building on these insights, we introduce the Layer Card, a reusable diagnostic that summarizes residual signal strength, compute cost, and performance for each layer of a given model. With an identical model and LoRA configuration, Layer Card-guided placement refines the choice of adapted layers to flexibly prioritize different objectives, such as maximizing performance or reducing fine-tuning cost. Moreover, on Qwen3-8B, we show that selectively adapting a subset of layers can achieve performance close to full-layer LoRA while substantially reducing fine-tuning cost and the number of adapter-augmented layers during inference, offering a more cost-performance-aware alternative to full-layer insertion.

Related papers

Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization [8.029535985033485]
Layer-wise capacity in large language models is non-uniform; some layers contribute disproportionately to loss reduction while others are near-redundant.<n>Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions.<n>We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle.
arXiv Detail & Related papers (2026-03-01T04:14:15Z)
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models [17.818685759025207]
Layer-wise pruning is a commonly employed strategy to mitigate inference costs.<n>This paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game.<n>It achieves more efficient and effective layer-wise pruning for large language models.
arXiv Detail & Related papers (2026-02-08T03:51:36Z)
Distilling to Hybrid Attention Models via KL-Guided Layer Selection [66.06591032073744]
This paper describes a simple and efficient recipe for layer selection that uses layer importance scores derived from a small amount of training on generic text data.<n>We find that this approach is more effective than existing approaches for layer selection, including approaches that uniformly interleave linear attentions based on a fixed ratio.
arXiv Detail & Related papers (2025-12-23T18:12:22Z)
The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models [33.90597962418094]
We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
arXiv Detail & Related papers (2025-10-25T16:40:17Z)
Hierarchical LoRA MoE for Efficient CTR Model Scaling [56.608809143548946]
HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
arXiv Detail & Related papers (2025-10-12T03:54:11Z)
FLoE: Fisher-Based Layer Selection for Efficient Sparse Adaptation of Low-Rank Experts [47.35092228595656]
FLoE is a novel PEFT framework that introduces two key innovations: (i) a Fisher information-guided importance scoring mechanism to dynamically identify task-critical transformer layers for MoE-based low-rank adaptation, enabling sparse adapter deployment; and (ii) a Bayesian optimization-driven rank allocator that automatically determines optimal LoRA ranks on specific datasets without exhaustive grid search.<n>Experiments across diverse LLMs and benchmarks reveal that FLoE achieves impressive efficiency-accuracy trade-offs, making FLoE particularly advantageous in resource-constrained environments that necessitate rapid adaptation.
arXiv Detail & Related papers (2025-05-31T10:27:08Z)
Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z)
Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$) Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch. We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z)
The Unreasonable Ineffectiveness of the Deeper Layers [5.984361440126354]
We find that removing a certain layer does not affect model performance in common question-answering benchmarks.<n>Surprisingly, with this method we find minimal degradation of performance until after a large fraction of the layers are removed.<n>From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.
arXiv Detail & Related papers (2024-03-26T17:20:04Z)
Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference [12.371152982808914]
We introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis.<n>An ACM consists of a sequence of learners that progressively refine the output of their preceding counterparts. An additional gating mechanism determines the optimal number of learners to execute for each token.<n>Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
arXiv Detail & Related papers (2023-12-15T20:39:43Z)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities. LVLMs are often problematic due to their massive computational/energy costs and carbon consumption. We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z)
Layer-adaptive sparsity for the Magnitude-based Pruning [88.37510230946478]
We propose a novel importance score for global pruning, coined layer-adaptive magnitude-based pruning (LAMP) score. LAMP consistently outperforms popular existing schemes for layerwise sparsity selection.
arXiv Detail & Related papers (2020-10-15T09:14:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.