Related papers: Allocation of Parameters in Transformers

Allocation of Parameters in Transformers

URL: http://arxiv.org/abs/2510.03784v1
Date: Sat, 04 Oct 2025 11:22:16 GMT
Title: Allocation of Parameters in Transformers
Authors: Ruoxi Yu, Haotian Jiang, Jingpu Cheng, Penghao Yu, Qianxiao Li, Zhong Li,
Abstract summary: We investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency.<n>We prove the emphsaturation behavior of softmax activations, supported by both theory and experiments.<n>We propose principled strategies for allocating attention heads and dimensions across Transformers' layers.
Score: 31.7433692306049
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have achieved remarkable successes across a wide range of applications, yet the theoretical foundation of their model efficiency remains underexplored. In this work, we investigate how the model parameters -- mainly attention heads and head dimensions -- should be allocated across layers to balance expressivity and efficiency. We first provide mathematical analysis on the role of early layers in information extraction from an approximation perspective, with a theoretical characterization on the trade-off between the number of heads and head dimension under a fixed parameter budget. In addition, we uncover and prove the \emph{saturation} behavior of softmax activations: Continuously increasing head dimensions can lead to diminishing returns in learning errors, particularly for long sequences. Supported by both theory and experiments, this saturation pattern suggests that later layers can operate more efficiently with reduced parameters. Combining these insights, we propose principled strategies for allocating attention heads and dimensions across Transformers' layers, shedding light on theoretically-grounded model efficiency of Transformer-based architectures.

Related papers

Efficient Hyperparameter Tuning via Trajectory Invariance Principle [35.90572735438328]
We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay.<n>This phenomenon effectively reduces the original two-dimensional hyper parameter space to one dimension, yielding an efficient tuning rule.<n>Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.
arXiv Detail & Related papers (2025-09-29T17:01:19Z)
Weight Spectra Induced Efficient Model Adaptation [54.8615621415845]
Fine-tuning large-scale foundation models incurs prohibitive computational costs.<n>We show that fine-tuning predominantly amplifies the top singular values while leaving the remainder largely intact.<n>We propose a novel method that leverages learnable rescaling of top singular directions.
arXiv Detail & Related papers (2025-05-29T05:03:29Z)
Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation [43.719298075378425]
We propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT) for parameter-efficient fine-tuning.<n>Experiments on 35 NLP and CV tasks demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT.
arXiv Detail & Related papers (2025-05-16T13:26:48Z)
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations [50.010924231754856]
Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence.<n>To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus.<n>We propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties.
arXiv Detail & Related papers (2025-04-01T14:36:45Z)
Towards Understanding the Optimization Mechanisms in Deep Learning [5.281849820329249]
In this paper, we adopt a distribution estimation perspective to explore the mechanisms of supervised classification using deep neural networks.<n>For the latter, we provide theoretical insights into mechanisms such as over- and probability randomization.
arXiv Detail & Related papers (2025-03-29T08:46:13Z)
How Transformers Utilize Multi-Head Attention in In-Context Learning? A Case Study on Sparse Linear Regression [19.64743851296488]
In this study, we consider a sparse linear regression problem and investigate how a trained multi-head transformer performs in-context learning. We experimentally discover that the utilization of multi-heads exhibits different patterns across layers. We demonstrate that such a preprocess-then-optimize algorithm can significantly outperform naive gradient descent and ridge regression algorithms.
arXiv Detail & Related papers (2024-08-08T15:33:02Z)
See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition [56.87609859444084]
parameter-efficient fine-tuning (PEFT) focuses on optimizing a select subset of parameters while keeping the rest fixed, significantly lowering computational and storage overheads.<n>We take the first step to unify all approaches by dissecting them from a decomposition perspective.<n>We introduce two novel PEFT methods alongside a simple yet effective framework designed to enhance the performance of PEFT techniques across various applications.
arXiv Detail & Related papers (2024-07-07T15:44:42Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Neural network analysis of neutron and X-ray reflectivity data: Incorporating prior knowledge for tackling the phase problem [141.5628276096321]
We present an approach that utilizes prior knowledge to regularize the training process over larger parameter spaces. We demonstrate the effectiveness of our method in various scenarios, including multilayer structures with box model parameterization. In contrast to previous methods, our approach scales favorably when increasing the complexity of the inverse problem.
arXiv Detail & Related papers (2023-06-28T11:15:53Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Analysis of Catastrophic Forgetting for Random Orthogonal Transformation Tasks in the Overparameterized Regime [9.184987303791292]
We show that in permuted MNIST image classification tasks, the performance of multilayer perceptrons trained by vanilla gradient descent can be improved. We provide a theoretical explanation of this effect by studying a qualitatively similar two-task linear regression problem. We show that when a model is trained on the two tasks in sequence without any additional regularization, the risk gain on the first task is small.
arXiv Detail & Related papers (2022-06-01T18:04:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.