BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
- URL: http://arxiv.org/abs/2512.04513v1
- Date: Thu, 04 Dec 2025 06:49:50 GMT
- Title: BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
- Authors: Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu,
- Abstract summary: BiTAgent is a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs.<n>Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines.
- Score: 29.69542501690896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.
Related papers
- Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z) - UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment [22.51114099598294]
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments.<n>UniFit is a universal VTON framework driven by a Multimodal Large Language Model (MLLM)<n>UniFit supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-11-19T19:38:44Z) - OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z) - Training-Free Multimodal Large Language Model Orchestration [16.211979950149928]
We report on an effective approach for creating interactive multimodal AI systems without additional training.<n>Our framework is built upon three key innovations: (1) a central controller that analyzes user inputs, (2) a parallel Text-to-Speech architecture, and (3) a crossmodal memory integration.
arXiv Detail & Related papers (2025-08-06T16:17:29Z) - Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models [21.20658517302458]
MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
arXiv Detail & Related papers (2025-07-11T08:45:27Z) - Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [8.88014241557266]
Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation.<n>Existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments.<n>We propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM)
arXiv Detail & Related papers (2025-06-05T13:27:41Z) - Mixture-of-Experts Meets In-Context Reinforcement Learning [49.19791753312034]
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks.<n>We propose T2MIR, an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.<n>We show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z) - Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Learning to Learn with Contrastive Meta-Objective [48.27877062976768]
We propose to exploit task identity as additional supervision in meta-training.<n>The proposed ConML is evaluating and optimizing the contrastive meta-objective.<n>We demonstrate that ConML integrates seamlessly with existing meta-learners, as well as in-context learning models.
arXiv Detail & Related papers (2024-10-08T12:22:10Z) - LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds.
Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines.
We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.