T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning
- URL: http://arxiv.org/abs/2404.08985v2
- Date: Tue, 27 May 2025 06:29:38 GMT
- Title: T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning
- Authors: Rongyu Zhang, Yijiang Liu, Huanrui Yang, Shenli Zheng, Dan Wang, Yuan Du, Li Du, Shanghang Zhang,
- Abstract summary: Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning.<n>We design a novel framework, mixunderlinetextbfTureunderlinetextbf-of-underlinetextbfRank-onunderlinetextbfE-eunderlinetextbfXper ts (textttT-REX)<n>Rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal
- Score: 31.276142111455847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning. Mixture-of-experts (MoE) provides a promising solution with a dynamic architecture, enabling effective task decoupling. However, scaling up the number of MoE experts incurs substantial parameter and computational overheads and suffers from limited performance gain due to naive routing mechanisms. In this paper, we design a novel framework, mix\underline{\textbf{T}}ure\underline{\textbf{-}}of-\underline{\textbf{R}}ank-on\underline{\textbf{E}}-e\underline{\textbf{X}}perts (\texttt{T-REX}), which leverages the combination of ultra-low rank experts to construct LoRA weights on pretrained LLMs. The rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal efficiency. In addition, T-REX offers implicit guidance to the router, leveraging the inherent semantic clustering of training embeddings as prior knowledge, enabling optimized feature allocation across experts for a smoother convergence. Extensive theoretical and empirical results demonstrate that T-REX achieves superior efficiency and generalizability across diverse tasks. Compared with other LoRA-based methods, T-REX achieves up to 1.78\% mean accuracy improvement with around 30\%-40\% less trainable parameters across 14 public datasets. \href{https://github.com/RoyZry98/T-REX-Pytorch}{Code} is available.
Related papers
- Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models [58.54288496296157]
Chain-of-Experts (CoE) is a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer.<n>To support dynamic expert selection across iterations, CoE employs a dedicated router at each step within a layer.
arXiv Detail & Related papers (2025-06-23T02:15:43Z) - MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models [61.89384981175277]
We propose a emphheterogeneous textbfMixture-of-Adapters (MoA) approach to integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE)<n> Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency.
arXiv Detail & Related papers (2025-06-06T09:54:19Z) - Multi-objective Large Language Model Alignment with Hierarchical Experts [39.14442626829845]
textitHoE consists of three hierarchical components: LoRA Experts, Router Experts and Preference Routing.<n>We evaluate textitHoE across various tasks on 14 objectives and 200 different preferences among 6 benchmarks, demonstrating superior performance over 15 recent baselines.
arXiv Detail & Related papers (2025-05-27T09:15:03Z) - TT-LoRA MoE: Unifying Parameter-Efficient Fine-Tuning and Sparse Mixture-of-Experts [4.5558042369389105]
TT-LoRA MoE decomposes training into two distinct optimized stages.<n>First, we independently train lightweight, tensorized low-rank adapters (TT-LoRA experts)<n>Subsequently, these expert adapters remain frozen, eliminating inter-task interference and forgetting in multi-task setting.<n>A sparse MoE router, trained separately, dynamically leverages base model representations to select exactly one specialized adapter per input at inference time.<n> Comprehensive experiments confirm our architecture retains the memory efficiency of low-rank adapters, seamlessly scales to large expert pools, and achieves robust task-level optimization.
arXiv Detail & Related papers (2025-04-29T21:46:43Z) - MaZO: Masked Zeroth-Order Optimization for Multi-Task Fine-Tuning of Large Language Models [26.980104922985326]
We present MaZO, the first framework specifically designed for multi-task LLM fine-tuning under ZO optimization.
MaZO tackles these challenges at the parameter level through two key innovations: a weight importance metric to identify critical parameters and a multi-task weight update mask to selectively update these parameters.
Experiments demonstrate that MaZO achieves state-of-the-art performance, surpassing even multi-task learning methods designed for first-order optimization.
arXiv Detail & Related papers (2025-02-17T07:28:52Z) - Meta-Sparsity: Learning Optimal Sparse Structures in Multi-task Networks through Meta-learning [4.462334751640166]
meta-sparsity is a framework for learning model sparsity that allows deep neural networks (DNNs) to generate optimal sparse shared structures in multi-task learning setting.
Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios.
The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets.
arXiv Detail & Related papers (2025-01-21T13:25:32Z) - Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning [59.001091197106085]
Multi-Task Learning (MTL) for Vision Transformer aims at enhancing the model capability by tackling multiple tasks simultaneously.
Most recent works have predominantly focused on designing Mixture-of-Experts (MoE) structures and in tegrating Low-Rank Adaptation (LoRA) to efficiently perform multi-task learning.
We propose a novel approach dubbed Efficient Multi-Task Learning (EMTAL) by transforming a pre-trained Vision Transformer into an efficient multi-task learner.
arXiv Detail & Related papers (2025-01-12T17:41:23Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - M$^2$PT: Multimodal Prompt Tuning for Zero-shot Instruction Learning [90.75075886543404]
Multimodal Large Language Models (MLLMs) demonstrate remarkable performance across a wide range of domains.
In this work, we introduce a novel Multimodal Prompt Tuning (M$2$PT) approach for efficient instruction tuning of MLLMs.
arXiv Detail & Related papers (2024-09-24T01:40:24Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - MoDE: Effective Multi-task Parameter Efficient Fine-Tuning with a Mixture of Dyadic Experts [6.245113492272563]
Mixture of Dyadic Experts (MoDE) is a novel design for efficient multi-task adaptation.
Our design allows for more fine-grained mixing, thereby increasing the model's ability to jointly handle multiple tasks.
arXiv Detail & Related papers (2024-08-02T18:05:10Z) - Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens.
MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z) - Exploring the Transferability of Visual Prompting for Multimodal Large Language Models [47.162575147632396]
Transferable Visual Prompting (TVP) is a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model.
We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts.
arXiv Detail & Related papers (2024-04-17T09:39:07Z) - Multi-Task Dense Prediction via Mixture of Low-Rank Experts [35.11968315125389]
We present a novel decoder-focused method for multi-task dense prediction, called Mixture-of-Low-Rank-Experts (MLoRE)
To model the global task relationships, MLoRE adds a generic convolution path to the original MoE structure, where each task feature can go through this path for explicit parameter sharing.
Our experiments show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics.
arXiv Detail & Related papers (2024-03-26T14:40:17Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Exploiting Inter-Layer Expert Affinity for Accelerating
Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models.
By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation.
Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Task Aware Feature Extraction Framework for Sequential Dependence
Multi-Task Learning [1.0765359420035392]
We analyze sequential dependence MTL from rigorous mathematical perspective.
We propose a Task Aware Feature Extraction (TAFE) framework for sequential dependence MTL.
arXiv Detail & Related papers (2023-01-06T13:12:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.