A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation
- URL: http://arxiv.org/abs/2508.10494v1
- Date: Thu, 14 Aug 2025 09:52:51 GMT
- Title: A Unified Multi-Agent Framework for Universal Multimodal Understanding and Generation
- Authors: Jiulin Li, Ping Huang, Yexin Li, Shuo Chen, Juewen Hu, Ye Tian,
- Abstract summary: We propose a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation.<n>In Cognition, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning.<n>The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner.
- Score: 8.021435739965982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world multimodal applications often require any-to-any capabilities, enabling both understanding and generation across modalities including text, image, audio, and video. However, integrating the strengths of autoregressive language models (LLMs) for reasoning and diffusion models for high-fidelity generation remains challenging. Existing approaches rely on rigid pipelines or tightly coupled architectures, limiting flexibility and scalability. We propose MAGUS (Multi-Agent Guided Unified Multimodal System), a modular framework that unifies multimodal understanding and generation via two decoupled phases: Cognition and Deliberation. MAGUS enables symbolic multi-agent collaboration within a shared textual workspace. In the Cognition phase, three role-conditioned multimodal LLM agents - Perceiver, Planner, and Reflector - engage in collaborative dialogue to perform structured understanding and planning. The Deliberation phase incorporates a Growth-Aware Search mechanism that orchestrates LLM-based reasoning and diffusion-based generation in a mutually reinforcing manner. MAGUS supports plug-and-play extensibility, scalable any-to-any modality conversion, and semantic alignment - all without the need for joint training. Experiments across multiple benchmarks, including image, video, and audio generation, as well as cross-modal instruction following, demonstrate that MAGUS outperforms strong baselines and state-of-the-art systems. Notably, on the MME benchmark, MAGUS surpasses the powerful closed-source model GPT-4o.
Related papers
- UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark [72.37370242707432]
This paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset.<n>UniM contains 31K high-quality instances across 30 domains and 7 representative modalities.<n>We also introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence.
arXiv Detail & Related papers (2026-03-05T11:45:16Z) - Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything [12.274140974616747]
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs.<n>We propose an Agent- Omni framework that coordinates existing foundation models through a master-agent system.
arXiv Detail & Related papers (2025-11-04T18:59:09Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Query-Kontext: An Unified Multimodal Model for Image Generation and Editing [53.765351127477224]
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I)<n>We introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs.<n> Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.
arXiv Detail & Related papers (2025-09-30T17:59:46Z) - UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z) - Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models [21.20658517302458]
MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
arXiv Detail & Related papers (2025-07-11T08:45:27Z) - Transfer between Modalities with MetaQueries [44.57406292414526]
We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs and diffusion models.<n>Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives.<n>Our method is flexible and can be easily instruction-tuned for advanced applications such as image editing and subject-driven generation.
arXiv Detail & Related papers (2025-04-08T17:58:47Z) - ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability [14.703591553247948]
ARMOR is a resource-efficient and pure autoregressive framework for multimodal large language models.<n>It achieves both understanding and generation by fine-tuning existing MLLMs.<n>We show that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources.
arXiv Detail & Related papers (2025-03-09T10:15:39Z) - LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification [41.88402339122694]
Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry.<n>This paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation.
arXiv Detail & Related papers (2024-09-23T13:16:09Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.