Mixture-of-Experts Meets In-Context Reinforcement Learning
- URL: http://arxiv.org/abs/2506.05426v1
- Date: Thu, 05 Jun 2025 06:29:14 GMT
- Title: Mixture-of-Experts Meets In-Context Reinforcement Learning
- Authors: Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang,
- Abstract summary: In this paper, we introduce textbfT2MIR (textbfToken- and textbfTask-wise textbfMoE for textbfIn-context textbfRL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.<n> Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines.
- Score: 29.866936147753368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose \textbf{T2MIR} (\textbf{T}oken- and \textbf{T}ask-wise \textbf{M}oE for \textbf{I}n-context \textbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.
Related papers
- MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Dynamic Mixture of Curriculum LoRA Experts for Continual Multimodal Instruction Tuning [45.019751165506946]
Continual multimodal instruction tuning is crucial for adapting Multimodal Large Language Models (MLLMs) to evolving tasks.<n>We propose a novel Dynamic Mixture of Curriculum LoRA Experts (D-MoLE) method, which automatically evolves MLLM's architecture with controlled parameter budgets to continually adapt to new tasks.<n>Specifically, we propose a dynamic layer-wise expert allocator, which automatically allocates LoRA experts across layers to resolve architecture conflicts.<n>Then, we propose a gradient-based inter-modal continual curriculum, which adjusts the update ratio of each module in MLLM based on the difficulty of each
arXiv Detail & Related papers (2025-06-13T11:03:46Z) - Resolving Task Objective Conflicts in Unified Multimodal Understanding and Generation via Task-Aware Mixture-of-Experts [11.307588007047407]
multimodal large language models (MLLMs) integrate both understanding and generation tasks within a single framework.<n> intrinsic Task Objective Conflicts between high-level semantic abstraction in understanding and fine-grained detail preservation in generation pose significant challenges.<n>We propose a novel approach that decouples internal components of AR to resolve task objective conflicts.
arXiv Detail & Related papers (2025-06-04T05:44:21Z) - M2IV: Towards Efficient and Fine-grained Multimodal In-Context Learning in Large Vision-Language Models [11.542439154523647]
We propose textbfM2IV, a method that substitutes explicit demonstrations with learnable textbfVectors directly integrated into LVLMs.<n>M2IV achieves robust cross-modal fidelity and fine-grained semantic distillation through training.<n>Experiments show that M2IV surpasses Vanilla ICL and prior representation engineering approaches.
arXiv Detail & Related papers (2025-04-06T22:02:21Z) - LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant [63.28378110792787]
We introduce LamRA, a versatile framework designed to empower Large Multimodal Models with sophisticated retrieval and reranking capabilities.<n>For retrieval, we adopt a two-stage training strategy comprising language-only pre-training and multimodal instruction tuning.<n>For reranking, we employ joint training for both pointwise and listwise reranking, offering two distinct ways to further boost the retrieval performance.
arXiv Detail & Related papers (2024-12-02T17:10:16Z) - Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model [16.03304915788997]
Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task that aims to extract entities and their relations from text-image pairs in social media posts.<n>Existing methods for JMERE require large amounts of labeled data.<n>We introduce the textbfKnowledge-textbfEnhanced textbfCross-modal textbfPrompt textbfModel.
arXiv Detail & Related papers (2024-10-18T07:14:54Z) - M3-Jepa: Multimodal Alignment via Multi-directional MoE based on the JEPA framework [6.928469290518152]
M3-Jepa is a scalable multimodal alignment framework with a predictor implemented by a multi-directional mixture of experts.<n>We show that M3-Jepa can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in training and inference.
arXiv Detail & Related papers (2024-09-09T10:40:50Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.<n>Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.<n>We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning [31.276142111455847]
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning.<n>We design a novel framework, mixunderlinetextbfTureunderlinetextbf-of-underlinetextbfRank-onunderlinetextbfE-eunderlinetextbfXper ts (textttT-REX)<n>Rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts [20.926613438442782]
Multi-Task Reinforcement Learning (MTRL) tackles the problem of endowing agents with skills that generalize across a variety of problems.
To this end, sharing representations plays a fundamental role in capturing both unique and common characteristics of the tasks.
We introduce a novel approach for representation learning in MTRL that encapsulates common structures among the tasks using representations to promote diversity.
arXiv Detail & Related papers (2023-11-19T18:09:25Z) - Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2.
It acquires the context related attribute and relation knowledge from the knowledge base.
We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z) - HiNet: Novel Multi-Scenario & Multi-Task Learning with Hierarchical Information Extraction [50.40732146978222]
Multi-scenario & multi-task learning has been widely applied to many recommendation systems in industrial applications.
We propose a Hierarchical information extraction Network (HiNet) for multi-scenario and multi-task recommendation.
HiNet achieves a new state-of-the-art performance and significantly outperforms existing solutions.
arXiv Detail & Related papers (2023-03-10T17:24:41Z) - Reparameterizing Convolutions for Incremental Multi-Task Learning
without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature.
First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning)
Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z) - Meta Reinforcement Learning with Autonomous Inference of Subtask
Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph.
Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference.
Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.