Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction
- URL: http://arxiv.org/abs/2512.01059v1
- Date: Sun, 30 Nov 2025 20:04:30 GMT
- Title: Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction
- Authors: Anantha Padmanaban Krishna Kumar,
- Abstract summary: We study two simple-reduction strategies applied to ViT-B/16 on ImageNet-1K.<n>Our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on ViT-B/16 trained on ImageNet-1K, we study two simple parameter-reduction strategies applied to the MLP blocks, each removing 32.7\% of the baseline parameters. Our \emph{GroupedMLP} variant shares MLP weights between adjacent transformer blocks and achieves 81.47\% top-1 accuracy while maintaining the baseline computational cost. Our \emph{ShallowMLP} variant halves the MLP hidden dimension and reaches 81.25\% top-1 accuracy with a 38\% increase in inference throughput. Both models outperform the 86.6M-parameter baseline (81.05\%) and exhibit substantially improved training stability, reducing peak-to-final accuracy degradation from 0.47\% to the range 0.03\% to 0.06\%. These results suggest that, for ViT-B/16 on ImageNet-1K with a standard training recipe, the model operates in an overparameterized regime in which MLP capacity can be reduced without harming performance and can even slightly improve it. More broadly, our findings suggest that architectural constraints such as parameter sharing and reduced width may act as useful inductive biases, and highlight the importance of how parameters are allocated when designing Vision Transformers. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/parameter-efficient-vit-mlps.
Related papers
- Diversity-Guided MLP Reduction for Efficient Large Vision Transformers [62.33249256133204]
Transformer models achieve excellent scaling property, where the performance is improved with the increment of model capacity.<n>Large-scale model parameters lead to an unaffordable cost of computing and memory.<n>We propose a Diversity-Guided Reduction (DGMR) method to significantly reduce the parameters of large vision transformers.
arXiv Detail & Related papers (2025-06-10T08:59:27Z) - SpiLiFormer: Enhancing Spiking Transformers with Lateral Inhibition [29.724968607408048]
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency.<n>We propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer) to address the issue of over-allocating attention to irrelevant contexts.<n>SpiLiFormer emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones.
arXiv Detail & Related papers (2025-03-20T09:36:31Z) - SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors [80.6043267994434]
We propose SVFT, a simple approach that fundamentally differs from existing methods.
SVFT updates (W) as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations.
Experiments on language and vision benchmarks show that SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters.
arXiv Detail & Related papers (2024-05-30T01:27:43Z) - From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models.
Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z) - SCHEME: Scalable Channel Mixer for Vision Transformers [52.605868919281086]
Vision Transformers have achieved impressive performance in many computation tasks.<n>We show that the dense connections can be replaced with a sparse block diagonal structure that supports larger expansion ratios.<n>We also propose the use of a lightweight, parameter-free, channel covariance attention mechanism as a parallel branch during training.
arXiv Detail & Related papers (2023-12-01T08:22:34Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Sparse MLP for Image Recognition: Is Self-Attention Really Necessary? [65.37917850059017]
We build an attention-free network called sMLPNet.
For 2D image tokens, sMLP applies 1D along the axial directions and the parameters are shared among rows or columns.
When scaling up to 66M parameters, sMLPNet achieves 83.4% top-1 accuracy, which is on par with the state-of-the-art Swin Transformer.
arXiv Detail & Related papers (2021-09-12T04:05:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.