Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
- URL: http://arxiv.org/abs/2511.02834v2
- Date: Wed, 05 Nov 2025 05:50:54 GMT
- Title: Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything
- Authors: Huawei Lin, Yunzhi Shi, Tong Geng, Weijie Zhao, Wei Wang, Ravender Pal Singh,
- Abstract summary: Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs.<n>We propose an Agent- Omni framework that coordinates existing foundation models through a master-agent system.
- Score: 12.274140974616747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.
Related papers
- AR-MOT: Autoregressive Multi-object Tracking [56.09738000988466]
We propose a novel autoregressive paradigm that formulates MOT as a sequence generation task within a large language model (LLM) framework.<n>This design enables the model to output structured results through flexible sequence construction, without requiring any task-specific heads.<n>To enhance region-level visual perception, we introduce an Object Tokenizer based on a pretrained detector.
arXiv Detail & Related papers (2026-01-05T09:17:28Z) - Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models [78.73992315826035]
We introduce Youtu-LLM, a lightweight language model that harmonizes high computational efficiency with native agentic intelligence.<n>Youtu-LLM is pre-trained from scratch to systematically cultivate reasoning and planning capabilities.
arXiv Detail & Related papers (2025-12-31T04:25:11Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations [34.15274225730601]
We propose textbfAgent-ScanKit to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations.<n>Results show that mechanical memorization often outweighs systematic reasoning.<n>Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios.
arXiv Detail & Related papers (2025-10-01T04:29:39Z) - Sample-efficient Integration of New Modalities into Large Language Models [48.81776019848246]
Multimodal foundation models can process several modalities.<n>We introduce a method for sample-efficient modality integration into Large Language Models.<n>We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities.
arXiv Detail & Related papers (2025-09-04T18:41:59Z) - A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation [8.021435739965982]
We propose a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation.<n>In Cognition, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning.<n>The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner.
arXiv Detail & Related papers (2025-08-14T09:52:51Z) - MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models [5.011371514152517]
Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language.<n>How to explain cross-modal interactions in multimodal AI models remains a major challenge.
arXiv Detail & Related papers (2025-08-01T12:19:18Z) - MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation [80.08991479306681]
MEXA is a training-free framework that performs modality- and task-aware aggregation of expert models.<n>We evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA.
arXiv Detail & Related papers (2025-06-20T16:14:13Z) - Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective [42.832839189236694]
We propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images.<n>Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent.<n> Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
arXiv Detail & Related papers (2025-05-27T07:23:38Z) - OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.<n>SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.<n>We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.