Related papers: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

URL: http://arxiv.org/abs/2502.20395v2
Date: Sat, 01 Mar 2025 02:17:00 GMT
Title: R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts
Authors: Zhongyang Li, Ziyue Li, Tianyi Zhou,
Abstract summary: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)<n>We propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimize the vector of routing weights in test-time.<n>R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.
Score: 21.119495676190127
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained router does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time (R2-T2)" that locally optimizes the vector of routing weights in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

Related papers

SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams [53.78257200138774]
We propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules.<n>We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily.
arXiv Detail & Related papers (2026-01-14T14:31:16Z)
Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs [24.791817951102487]
We show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap.<n>In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA.
arXiv Detail & Related papers (2025-11-10T18:59:53Z)
Tiny-R1V: Lightweight Multimodal Unified Reasoning Model via Model Merging [34.0419616643477]
Tiny-R1V is a novel lightweight 3B model that achieves faster inference and higher accuracy via a two-stage optimization.<n>In the first stage, Tiny-R1V introduces Length-Informed Relative Policy Optimization (LIPO), a novel reinforcement learning method.<n>In the second stage, we propose Adaptive Model Merging (AMM), a training-free model merging method.
arXiv Detail & Related papers (2025-10-10T04:14:57Z)
Recurrence Meets Transformers for Universal Multimodal Retrieval [59.92546492752452]
ReT-2 is a unified retrieval model that supports multimodal queries composed of both images and text.<n>We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations.<n>When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets.
arXiv Detail & Related papers (2025-09-10T18:00:29Z)
Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications [6.603505460200282]
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention.<n>We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency.
arXiv Detail & Related papers (2025-08-06T20:37:24Z)
Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs [21.541258368039955]
We find that layers of a pretrained large language model (LLM) can be manipulated as separate modules to build a better and even shallower model customized for each test sample.<n>In particular, each layer from the pretrained model can be skipped/pruned or repeated multiple times as recurrent neural networks (RNN), and stacked with others in arbitrary orders, yielding a chain-of-layers (CoLa) per sample.
arXiv Detail & Related papers (2025-07-10T17:59:53Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
MoRE: A Mixture of Low-Rank Experts for Adaptive Multi-Task Learning [18.0412262027514]
We propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task.<n>Instead of using an individual LoRA for each task, we align different ranks of LoRA module with different tasks.<n>We also design a novel adaptive rank selector to select the appropriate expert for each task.
arXiv Detail & Related papers (2025-05-28T12:32:09Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.30364248231053]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M2RAG) M2RAG is a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models (MLLMs) To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z)
ORI: O Routing Intelligence [0.7493096930372414]
Single large language models (LLMs) often fall short when faced with the ever-growing range of tasks.<n>We propose ORI (O Routing Intelligence), a dynamic framework that leverages a set of LLMs.<n>By intelligently routing queries, ORI outperforms the strongest individual models by up to 2.7 points on MMLU and 1.8 points on MuSR.
arXiv Detail & Related papers (2025-02-14T10:00:20Z)
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning [51.54046200512198]
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent.
arXiv Detail & Related papers (2025-01-25T14:24:50Z)
R-MTLLMF: Resilient Multi-Task Large Language Model Fusion at the Wireless Edge [78.26352952957909]
Multi-task large language models (MTLLMs) are important for many applications at the wireless edge, where users demand specialized models to handle multiple tasks efficiently.<n>The concept of model fusion via task vectors has emerged as an efficient approach for combining fine-tuning parameters to produce an MTLLM.<n>In this paper, the problem of enabling edge users to collaboratively craft such MTLMs via tasks vectors is studied, under the assumption of worst-case adversarial attacks.
arXiv Detail & Related papers (2024-11-27T10:57:06Z)
Visual Reasoning and Multi-Agent Approach in Multimodal Large Language Models (MLLMs): Solving TSP and mTSP Combinatorial Challenges [5.934258790280767]
Multimodal Large Language Models (MLLMs) harness comprehensive knowledge spanning text, images, and audio to adeptly tackle complex problems. This study explores the ability of MLLMs in visually solving the Traveling Salesman Problem (TSP) and Multiple Traveling Salesman Problem (mTSP) We introduce a novel approach employing multiple specialized agents within the MLLM framework, each dedicated to optimizing solutions for these challenges.
arXiv Detail & Related papers (2024-06-26T07:12:06Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Multi-Model Least Squares-Based Recomputation Framework for Large Data Analysis [0.0]
In complex tasks such as handling the ImageNet dataset, there are often many more clues that can be directly encoded. This serves as the motivation to retrain the latent space representations to learn some clues that unsupervised learning has not yet learned. In this paper, a recomputation-based multilayer network using MP inverse (RML-MP) is developed.
arXiv Detail & Related papers (2021-01-04T23:01:30Z)
MRDet: A Multi-Head Network for Accurate Oriented Object Detection in Aerial Images [51.227489316673484]
We propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. To obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately.
arXiv Detail & Related papers (2020-12-24T06:36:48Z)
Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems [12.806496583571858]
Multispectral pedestrian detection can adapt to insufficient illumination conditions by leveraging color-thermal modalities. Compared with traditional pedestrian detection, we find multispectral pedestrian detection suffers from modality imbalance problems. We propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner.
arXiv Detail & Related papers (2020-08-07T08:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.