Related papers: WorldGPT: Empowering LLM as Multimodal World Model

WorldGPT: Empowering LLM as Multimodal World Model

URL: http://arxiv.org/abs/2404.18202v2
Date: Sat, 28 Sep 2024 17:00:44 GMT
Title: WorldGPT: Empowering LLM as Multimodal World Model
Authors: Zhiqi Ge, Hongzhe Huang, Mingze Zhou, Juncheng Li, Guoming Wang, Siliang Tang, Yueting Zhuang,
Abstract summary: We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM) WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. We conduct evaluations on WorldNet, a multimodal state transition prediction benchmark.
Score: 51.243464216500975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models are progressively being employed across diverse fields, extending from basic environment simulation to complex scenario construction. However, existing models are mainly trained on domain-specific states and actions, and confined to single-modality state representations. In this paper, We introduce WorldGPT, a generalist world model built upon Multimodal Large Language Model (MLLM). WorldGPT acquires an understanding of world dynamics through analyzing millions of videos across various domains. To further enhance WorldGPT's capability in specialized scenarios and long-term tasks, we have integrated it with a novel cognitive architecture that combines memory offloading, knowledge retrieval, and context reflection. As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios. Conducting evaluations on WorldNet directly demonstrates WorldGPT's capability to accurately model state transition patterns, affirming its effectiveness in understanding and predicting the dynamics of complex scenarios. We further explore WorldGPT's emerging potential in serving as a world simulator, helping multimodal agents generalize to unfamiliar domains through efficiently synthesising multimodal instruction instances which are proved to be as reliable as authentic data for fine-tuning purposes. The project is available on \url{https://github.com/DCDmllm/WorldGPT}.

Related papers

The Trinity of Consistency as a Defining Principle for General World Models [106.16462830681452]
General World Models are capable of learning, simulating, and reasoning about objective physical laws.<n>We propose a principled theoretical framework that defines the essential properties requisite for a General World Model.<n>Our work establishes a principled pathway toward general world models, clarifying both the limitations of current systems and the architectural requirements for future progress.
arXiv Detail & Related papers (2026-02-26T16:15:55Z)
Cross-Domain Generalization of Multimodal LLMs for Global Photovoltaic Assessment [5.156484100374059]
This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment.<n>By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema.<n>Cross-regional evaluation using the $$F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions.
arXiv Detail & Related papers (2025-11-24T10:26:30Z)
PAN: A World Model for General, Interactable, and Long-Horizon World Simulation [49.805071498152536]
We introduce PAN, a general, interactable, and long-horizon world model.<n>It predicts future world states through high-quality video simulation conditioned on history and natural language actions.<n>Experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning.
arXiv Detail & Related papers (2025-11-12T07:20:35Z)
MrCoM: A Meta-Regularized World-Model Generalizing Across Multi-Scenarios [25.07812895067576]
We build a unified world model capable of generalizing across different scenarios, named Meta-Regularized Contextual World-Model (MrCoM)<n>We evaluate our algorithm's generalization ability across diverse scenarios, demonstrating significantly better performance than previous state-of-the-art methods.
arXiv Detail & Related papers (2025-11-09T07:01:18Z)
World Model Implanting for Test-time Adaptation of Embodied Agents [29.514831254621438]
In embodied AI, a persistent challenge is enabling agents to robustly adapt to novel domains without requiring extensive data collection or retraining.<n>We present a world model implanting framework (WorMI) that combines the reasoning capabilities of large language models with independently learned, domain-specific world models.<n>We evaluate our WorMI on the VirtualHome and ALFWorld benchmarks, demonstrating superior zero-shot and few-shot performance compared to several LLM-based approaches.
arXiv Detail & Related papers (2025-09-04T07:32:16Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification [7.523866920738647]
This paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer) The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer.
arXiv Detail & Related papers (2025-03-18T01:45:51Z)
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation [45.03755994315517]
We introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL) We find that reasoning models trained with large-scale reinforcement learning outperform others. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs.
arXiv Detail & Related papers (2025-02-18T17:59:48Z)
From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons [85.99268361356832]
We introduce a process of adapting an MLLM to a Generalist Embodied Agent (GEA) GEA is a single unified model capable of grounding itself across varied domains through a multi-embodiment action tokenizer. Our findings reveal the importance of training with cross-domain data and online RL for building generalist agents.
arXiv Detail & Related papers (2024-12-11T15:06:25Z)
One-shot World Models Using a Transformer Trained on a Synthetic Prior [37.027893127637036]
One-Shot World Model (OSWM) is a transformer world model that is learned in an in-context learning fashion from purely synthetic data. OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment.
arXiv Detail & Related papers (2024-09-21T09:39:32Z)
Making Large Language Models into World Models with Precondition and Effect Knowledge [1.8561812622368763]
We show that Large Language Models (LLMs) can be induced to perform two critical world model functions. We validate that the precondition and effect knowledge generated by our models aligns with human understanding of world dynamics.
arXiv Detail & Related papers (2024-09-18T19:28:04Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language. To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates. We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
GroundingGPT:Language Enhanced Multi-modal Grounding Model [15.44099961048236]
We propose GroundingGPT, a language enhanced multi-modal grounding model. Our proposed model excels at tasks demanding a detailed understanding of local information within the input. It demonstrates precise identification and localization of specific regions in images or moments in videos.
arXiv Detail & Related papers (2024-01-11T17:41:57Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
Deep Multimodal Fusion for Generalizable Person Re-identification [15.250738959921872]
DMF is a Deep Multimodal Fusion network for the general scenarios on person re-identification task. Rich semantic knowledge is introduced to assist in feature representation learning during the pre-training stage. A realistic dataset is adopted to fine-tine the pre-trained model for distribution alignment with real-world.
arXiv Detail & Related papers (2022-11-02T07:42:48Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.