Improving Recursive Transformers with Mixture of LoRAs
- URL: http://arxiv.org/abs/2512.12880v2
- Date: Wed, 17 Dec 2025 18:41:37 GMT
- Title: Improving Recursive Transformers with Mixture of LoRAs
- Authors: Mohammadmahdi Nouriborji, Morteza Rohanian, Omid Rohanian,
- Abstract summary: Mixture of LoRAs (MoL) inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN)<n>MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters.<n>ModernALBERT achieves state-of-the-art performance among compact models.
- Score: 2.672804414228544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parameter sharing in recursive transformers reduces model size but collapses layer-wise expressivity. We propose Mixture of LoRAs (MoL), a lightweight conditional-computation mechanism that inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN). MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters, unlike prior approaches that add fixed or externally attached adapters. We pretrain a modernised recursive architecture, ModernALBERT, integrating rotary embeddings, GeGLU, FlashAttention, and a distillation-based initialisation. Across GLUE, SQuAD-v2, and BEIR, ModernALBERT (50M--120M) achieves state-of-the-art performance among compact models and surpasses larger fully parameterised baselines. We also propose an expert-merging procedure that compresses MoL into a single adapter at inference while preserving accuracy, enabling efficient deployment. Our results show that conditional weight-space modulation effectively restores the expressivity lost under aggressive parameter sharing in recursive transformers.
Related papers
- High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning [57.85676271833619]
Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning.<n>We present textbfSMoA, a high-rank textbfStructured textbfMOdulation textbfAdapter that uses fewer trainable parameters while maintaining a higher rank.
arXiv Detail & Related papers (2026-01-12T13:06:17Z) - MSLoRA: Multi-Scale Low-Rank Adaptation via Attention Reweighting [6.335488846185043]
MSLoRA is a backbone-agnostic, parameter-efficient adapter that reweights feature responses rather than re-tuning the underlying backbone.<n>MSLoRA unifies adaptation for both convolutional neural networks (CNNs) and vision transformers (ViTs)
arXiv Detail & Related papers (2025-11-16T00:35:37Z) - Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression [14.086434595924716]
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead.<n>We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively.
arXiv Detail & Related papers (2025-09-27T10:45:58Z) - Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts [72.22148263683037]
We study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures.<n>First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature.<n>Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks.
arXiv Detail & Related papers (2025-07-09T03:25:45Z) - Hyper-Transforming Latent Diffusion Models [13.183678451110938]
We introduce a novel generative framework for functions by integrating Implicit Neural Representations (INRs) and Transformer-based hypernetworks into latent variable models.<n>Our framework extends latent diffusion models (LDMs) to INR generation by replacing standard decoders with a Transformer-based hypernetwork.<n>This enables efficient adaptation of existing generative models to INR-based representations without requiring full retraining.
arXiv Detail & Related papers (2025-04-23T10:01:18Z) - HAFLQ: Heterogeneous Adaptive Federated LoRA Fine-tuned LLM with Quantization [55.972018549438964]
Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving privacy.<n>We propose HAFLQ (Heterogeneous Adaptive Federated Low-Rank Adaptation Fine-tuned LLM with Quantization), a novel framework for efficient and scalable fine-tuning of LLMs in heterogeneous environments.<n> Experimental results on the text classification task demonstrate that HAFLQ reduces memory usage by 31%, lowers communication cost by 49%, improves accuracy by 50%, and achieves faster convergence compared to the baseline method.
arXiv Detail & Related papers (2024-11-10T19:59:54Z) - Replay-Free Continual Low-Rank Adaptation with Dynamic Memory [62.85596937435928]
We revisit continual learning, which enables pre-trained vision transformers (ViTs) to sequentially fine-tune on new downstream tasks over time.<n>Recent studies highlight a crossover between CL techniques and parameter-efficient fine-tuning.<n>We propose a novel PEFT-CL method called Dual Low-Rank Adaptation (DualLoRA)
arXiv Detail & Related papers (2024-11-01T14:28:39Z) - Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA [38.30350849992281]
"Recursive" language models share parameters across layers with minimal loss of performance.<n>Recursive Transformers are efficiently from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop.<n>We show that our models outperform both similar-sized vanilla pretrained models and knowledge distillation baselines.
arXiv Detail & Related papers (2024-10-28T02:15:45Z) - Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation [33.05581803204543]
Adapting pre-trained large language models (LLMs) is crucial but challenging due to their enormous size.<n>We introduce SketchTune, a compressive adaptation strategy that compresses weights into compact fine-tunable sketches.<n>SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method.<n>We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation.<n>Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z) - MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.