DMWM: Dual-Mind World Model with Long-Term Imagination
- URL: http://arxiv.org/abs/2502.07591v2
- Date: Thu, 23 Oct 2025 05:56:53 GMT
- Title: DMWM: Dual-Mind World Model with Long-Term Imagination
- Authors: Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan,
- Abstract summary: We propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency.<n>The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite.
- Score: 43.39205414684229
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imagination in world models is crucial for enabling agents to learn long-horizon policy in a sample-efficient manner. Existing recurrent state-space model (RSSM)-based world models depend on single-step statistical inference to capture the environment dynamics, and, hence, they are unable to perform long-term imagination tasks due to the accumulation of prediction errors. Inspired by the dual-process theory of human cognition, we propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. DMWM is composed of two components: an RSSM-based System 1 (RSSM-S1) component that handles state transitions in an intuitive manner and a logic-integrated neural network-based System 2 (LINN-S2) component that guides the imagination process through hierarchical deep logical reasoning. The inter-system feedback mechanism is designed to ensure that the imagination process follows the logical rules of the real environment. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite. Extensive experimental results demonstrate that the proposed framework yields significant improvements in terms of logical coherence, trial efficiency, data efficiency and long-term imagination over the state-of-the-art world models.
Related papers
- DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving [47.573692944838115]
DriveMamba is a Task-Centric Scalable paradigm for efficient E2E-AD.<n>It integrates sequential task relation modeling, implicit correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder.<n>Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
arXiv Detail & Related papers (2026-02-09T11:48:29Z) - Dual Mind World Model Inspired Network Digital Twin for Access Scheduling [0.904861150954008]
We present a novel Digital Twin-enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture.<n>Unlike conventional rule-based or purely data-driven policies, the proposed DMWM combines short-horizon predictive planning with symbolic model-based rollout.<n>Our results show that DMWM achieves superior performance in bursty, interference-limited, and deadline-sensitive environments.
arXiv Detail & Related papers (2026-02-04T13:53:55Z) - Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z) - MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection [94.12444452690329]
This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2025-11-22T06:04:29Z) - Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks [43.39205414684229]
This paper proposes a novel dual-mind world model-based learning framework for mmWave V2X networks.<n>Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component.<n> Simulation results show that the proposed world model achieves a significant improvement in data efficiency.
arXiv Detail & Related papers (2025-10-28T15:45:15Z) - Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation [69.94565127141483]
Current approaches separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability.<n>We propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone.<n>We show that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset.
arXiv Detail & Related papers (2025-10-09T18:18:11Z) - BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data [0.09363323206192666]
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes.<n>We introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-rangetemporal attributes in fMRI data.<n>Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships.
arXiv Detail & Related papers (2025-06-27T19:20:41Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation [89.5123417007126]
We show how to make Large Multimodal Models (LMMs) understand the spatial action space.<n>We also show how to fully exploit the reasoning capacity of LMMs in solving these tasks.<n>Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages.
arXiv Detail & Related papers (2025-05-19T06:00:14Z) - World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks [53.98633183204453]
In this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network.<n>A world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling.<n>In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions.
arXiv Detail & Related papers (2025-05-03T06:23:18Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.
It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.
Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)
Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [57.66267515456075]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations.
We propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z) - Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a paradigm-shifting framework that bypasses MLLM finetuning entirely.<n>DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations.
arXiv Detail & Related papers (2025-01-26T03:29:18Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z) - Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs)
Existing binarization methods result in significant performance degradation.
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Large Multi-Modal Models (LMMs) as Universal Foundation Models for
AI-Native Wireless Systems [57.41621687431203]
Large language models (LLMs) and foundation models have been recently touted as a game-changer for 6G systems.
This paper presents a comprehensive vision on how to design universal foundation models tailored towards the deployment of artificial intelligence (AI)-native networks.
arXiv Detail & Related papers (2024-01-30T00:21:41Z) - A Biologically-Inspired Dual Stream World Model [0.456877715768796]
The medial temporal lobe (MTL) is hypothesized to be an experience-construction system in mammals.
We propose a novel variant, the Dual Stream World Model (DSWM), which learns from high-dimensional observations and dissociates them into context and content streams.
We show that this representation is useful as a reinforcement learning basis function, and that the generative model can be used to aid the policy learning process using Dyna-like updates.
arXiv Detail & Related papers (2022-09-16T16:27:48Z) - One-shot Visual Reasoning on RPMs with an Application to Video Frame
Prediction [1.0932251830449902]
Raven's Progressive Matrices (RPMs) are frequently used in evaluating human's visual reasoning ability.
We propose a One-shot Human-Understandable ReaSoner (Os-HURS) to tackle the challenges of real-world visual recognition and subsequent logical reasoning tasks.
arXiv Detail & Related papers (2021-11-24T06:51:38Z) - Improving Coherence and Consistency in Neural Sequence Models with
Dual-System, Neuro-Symbolic Reasoning [49.6928533575956]
We use neural inference to mediate between the neural System 1 and the logical System 2.
Results in robust story generation and grounded instruction-following show that this approach can increase the coherence and accuracy of neurally-based generations.
arXiv Detail & Related papers (2021-07-06T17:59:49Z) - Relational State-Space Model for Stochastic Multi-Object Systems [24.234120525358456]
This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model.
R-SSM makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects.
The utility of R-SSM is empirically evaluated on synthetic and real time-series datasets.
arXiv Detail & Related papers (2020-01-13T03:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.