Related papers: BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models

URL: http://arxiv.org/abs/2512.04513v1
Date: Thu, 04 Dec 2025 06:49:50 GMT
Title: BiTAgent: A Task-Aware Modular Framework for Bidirectional Coupling between Multimodal Large Language Models and World Models
Authors: Yu-Wei Zhan, Xin Wang, Pengzhe Mao, Tongtong Feng, Ren Wang, Wenwu Zhu,
Abstract summary: BiTAgent is a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs.<n>Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines.
Score: 29.69542501690896
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Building generalist embodied agents requires a unified system that can interpret multimodal goals, model environment dynamics, and execute reliable actions across diverse real-world tasks. Multimodal large language models (MLLMs) offer strong semantic priors and cross-modal generalization, while world models (WMs) provide actionable latent dynamics for prediction and control. Their combination holds promise for open-ended embodied intelligence, yet introduces two key challenges: (1) establishing a tight coupling between the semantic intent from MLLMs and the dynamic state representations within the WM's latent space, and (2) achieving task-aware adaptability that supports multi-task learning and cross-environment generalization. To address these limitations, we propose BiTAgent, a task-aware dynamic joint framework that enables bidirectional coupling between MLLMs and WMs. BiTAgent establishes two complementary pathways: a forward path that injects MLLM representations into the WM's latent space for semantically guided imagination, and a backward path where WM-generated feedback refines the MLLM's semantic space via dense text-conditioned rewards. This bidirectional interaction is realized through three synergistic components: Task-Aware Dynamic Joint Learning, Task-Aware Behavior Learning, and MLLM-WM Joint Optimization, which together harmonize semantic reasoning and dynamic prediction. Extensive experiments across multi-task and cross-environment settings demonstrate superior stability and generalization over state-of-the-art baselines, marking a step toward open-ended embodied learning.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment [22.51114099598294]
Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments.<n>UniFit is a universal VTON framework driven by a Multimodal Large Language Model (MLLM)<n>UniFit supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-11-19T19:38:44Z)
OmniBridge: Unified Multimodal Understanding, Generation, and Retrieval via Latent Space Alignment [79.98946571424607]
We present OmniBridge, a unified framework that supports vision-language understanding, generation, and retrieval within a unified architecture.<n>To address the challenge of task interference, we propose a two-stage decoupled training strategy.<n>Experiments demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks.
arXiv Detail & Related papers (2025-09-23T13:57:55Z)
Training-Free Multimodal Large Language Model Orchestration [16.211979950149928]
We report on an effective approach for creating interactive multimodal AI systems without additional training.<n>Our framework is built upon three key innovations: (1) a central controller that analyzes user inputs, (2) a parallel Text-to-Speech architecture, and (3) a crossmodal memory integration.
arXiv Detail & Related papers (2025-08-06T16:17:29Z)
Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models [21.20658517302458]
MuGCP (Multi-modal Mutual-Guidance Conditional Prompt Learning) is a novel paradigm designed for conditional prompt generation.<n> AMG module generates Visual Conditional Prompts (VCP), enhancing the model's performance in multi-modal tasks.<n>MPF mechanism integrates SCP andVCP with contextual prompts, ensuring seamless coordination.
arXiv Detail & Related papers (2025-07-11T08:45:27Z)
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [8.88014241557266]
Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation.<n>Existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments.<n>We propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM)
arXiv Detail & Related papers (2025-06-05T13:27:41Z)
Mixture-of-Experts Meets In-Context Reinforcement Learning [49.19791753312034]
In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks.<n>We propose T2MIR, an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.<n>We show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines.
arXiv Detail & Related papers (2025-06-05T06:29:14Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Learning to Learn with Contrastive Meta-Objective [48.27877062976768]
We propose to exploit task identity as additional supervision in meta-training.<n>The proposed ConML is evaluating and optimizing the contrastive meta-objective.<n>We demonstrate that ConML integrates seamlessly with existing meta-learners, as well as in-context learning models.
arXiv Detail & Related papers (2024-10-08T12:22:10Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.