Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization
- URL: http://arxiv.org/abs/2603.00910v1
- Date: Sun, 01 Mar 2026 04:14:15 GMT
- Title: Curvature-Weighted Capacity Allocation: A Minimum Description Length Framework for Layer-Adaptive Large Language Model Optimization
- Authors: Theophilus Amaefuna, Hitesh Vaidya, Anshuman Chhabra, Ankur Mali,
- Abstract summary: Layer-wise capacity in large language models is non-uniform; some layers contribute disproportionately to loss reduction while others are near-redundant.<n>Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions.<n>We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle.
- Score: 8.029535985033485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Layer-wise capacity in large language models is highly non-uniform: some layers contribute disproportionately to loss reduction while others are near-redundant. Existing methods for exploiting this non-uniformity, such as influence-function-based layer scoring, produce sensitivity estimates but offer no principled mechanism for translating them into allocation or pruning decisions under hardware constraints. We address this gap with a unified, curvature-aware framework grounded in the Minimum Description Length (MDL) principle. Our central quantity is the curvature-adjusted layer gain $ζ_k^2 = g_k^\top \widetilde{H}_{kk}^{-1} g_k$, which we show equals twice the maximal second-order reduction in empirical risk achievable by updating layer $k$ alone, and which strictly dominates gradient-norm-based scores by incorporating local curvature. Normalizing these gains into layer quality scores $q_k$, we formulate two convex MDL programs: a capacity allocation program that distributes expert slots or LoRA rank preferentially to high-curvature layers under diminishing returns, and a pruning program that concentrates sparsity on low-gain layers while protecting high-gain layers from degradation. Both programs admit unique closed-form solutions parameterized by a single dual variable, computable in $O(K \log 1/\varepsilon)$ via bisection. We prove an $O(δ^2)$ transfer regret bound showing that source-domain allocations remain near-optimal on target tasks when curvature scores drift by $δ$, with explicit constants tied to the condition number of the target program. Together, these results elevate layer-wise capacity optimization from an empirical heuristic to a theoretically grounded, computationally efficient framework with provable optimality and generalization guarantees.
Related papers
- $\
abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z) - Understanding and Guiding Layer Placement in Parameter-Efficient Fine-Tuning of Large Language Models [19.448467763421707]
Large language models (LLMs) continue to grow, making parameter-efficient fine-tuning the default strategy for downstream adaptation.<n>Current practice typically applies PEFT uniformly across all layers, with limited understanding or leverage of layer selection.<n>This paper develops a unified projected residual view of PEFT on top of a frozen base model.
arXiv Detail & Related papers (2026-02-03T21:05:55Z) - FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching [4.01326804806241]
We introduce Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC)<n>R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer.<n>BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy.
arXiv Detail & Related papers (2026-01-09T10:06:45Z) - The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models [33.90597962418094]
We propose CLP, a novel continuous layer pruning framework for large language models.<n>CLP uses differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning.<n>CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.
arXiv Detail & Related papers (2025-10-25T16:40:17Z) - Differentiable Sparsity via $D$-Gating: Simple and Versatile Structured Penalization [22.883367233817836]
We show that $D$-Gating is theoretically equivalent to solving the original group sparsity problem.<n>We validate our theory across vision, language, and tasks, where $D$-Gating consistently delivers strong performance-sparsity tradeoffs.
arXiv Detail & Related papers (2025-09-28T14:08:29Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization.
This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z) - Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach [57.92727189589498]
We propose an online convex optimization approach with two different levels of adaptivity.
We obtain $mathcalO(log V_T)$, $mathcalO(d log V_T)$ and $hatmathcalO(sqrtV_T)$ regret bounds for strongly convex, exp-concave and convex loss functions.
arXiv Detail & Related papers (2023-07-17T09:55:35Z) - Recursive greedy initialization of the quantum approximate optimization
algorithm with guaranteed improvement [1.720510639137902]
Quantum approximate optimization algorithm (QAOA) is a variational quantum algorithm, where a quantum computer implements a variational ansatz consisting of $p$ layers of alternating unitary operators.
We present an analytic construction of $2p+1$ transition states for QAOA with $p+1$ layers that use the local minimum of QAOA with $p$ layers.
arXiv Detail & Related papers (2022-09-02T16:40:21Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.