Related papers: DMWM: Dual-Mind World Model with Long-Term Imagination

Related papers

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving [47.573692944838115]
DriveMamba is a Task-Centric Scalable paradigm for efficient E2E-AD.<n>It integrates sequential task relation modeling, implicit correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder.<n>Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
arXiv Detail & Related papers (2026-02-09T11:48:29Z)
Dual Mind World Model Inspired Network Digital Twin for Access Scheduling [0.904861150954008]
We present a novel Digital Twin-enabled scheduling framework inspired by Dual Mind World Model (DMWM) architecture.<n>Unlike conventional rule-based or purely data-driven policies, the proposed DMWM combines short-horizon predictive planning with symbolic model-based rollout.<n>Our results show that DMWM achieves superior performance in bursty, interference-limited, and deadline-sensitive environments.
arXiv Detail & Related papers (2026-02-04T13:53:55Z)
Causal World Modeling for Robot Control [56.31803788587547]
Video world models provide the ability to imagine the near future by understanding the causality between actions and visual dynamics.<n>We introduce LingBot-VA, an autoregressive diffusion framework that learns frame prediction and policy execution simultaneously.<n>We evaluate our model on both simulation benchmarks and real-world scenarios, where it shows significant promise in long-horizon manipulation, data efficiency in post-training, and strong generalizability to novel configurations.
arXiv Detail & Related papers (2026-01-29T17:07:43Z)
MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection [94.12444452690329]
This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2025-11-22T06:04:29Z)
Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks [43.39205414684229]
This paper proposes a novel dual-mind world model-based learning framework for mmWave V2X networks.<n>Inspired by cognitive psychology, the proposed dual-mind world model encompasses a pattern-driven System 1 component and a logic-driven System 2 component.<n> Simulation results show that the proposed world model achieves a significant improvement in data efficiency.
arXiv Detail & Related papers (2025-10-28T15:45:15Z)
Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation [69.94565127141483]
Current approaches separate navigation planning from visual world modeling, leading to state-action misalignment and limited adaptability.<n>We propose UniWM, a unified, memory-augmented world model integrating egocentric visual foresight and planning within a single multimodal autoregressive backbone.<n>We show that UniWM substantially improves navigation success rates by up to 30%, significantly reduces trajectory errors compared to strong baselines, and exhibits impressive zero-shot generalization on the unseen TartanDrive dataset.
arXiv Detail & Related papers (2025-10-09T18:18:11Z)
BrainMT: A Hybrid Mamba-Transformer Architecture for Modeling Long-Range Dependencies in Functional MRI Data [0.09363323206192666]
Recent advances in deep learning have made it possible to predict phenotypic measures directly from functional magnetic resonance imaging (fMRI) brain volumes.<n>We introduce BrainMT, a novel hybrid framework designed to efficiently learn and integrate long-rangetemporal attributes in fMRI data.<n>Our framework operates in two stages: (1) a bidirectional Mamba block with a temporal-first scanning mechanism to capture global temporal interactions in a computationally efficient manner; and (2) a transformer block leveraging self-attention to model global spatial relationships.
arXiv Detail & Related papers (2025-06-27T19:20:41Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Incentivizing Multimodal Reasoning in Large Models for Direct Robot Manipulation [89.5123417007126]
We show how to make Large Multimodal Models (LMMs) understand the spatial action space.<n>We also show how to fully exploit the reasoning capacity of LMMs in solving these tasks.<n>Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages.
arXiv Detail & Related papers (2025-05-19T06:00:14Z)
World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks [53.98633183204453]
In this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network.<n>A world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling.<n>In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions.
arXiv Detail & Related papers (2025-05-03T06:23:18Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT) Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [57.66267515456075]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations. We propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z)
Zero-Shot Interactive Text-to-Image Retrieval via Diffusion-Augmented Representations [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a paradigm-shifting framework that bypasses MLLM finetuning entirely.<n>DAR synergizes Large Language Model (LLM)-guided query refinement with Diffusion Model (DM)-based visual synthesis to create contextually enriched intermediate representations.
arXiv Detail & Related papers (2025-01-26T03:29:18Z)
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
Large Multi-Modal Models (LMMs) as Universal Foundation Models for AI-Native Wireless Systems [57.41621687431203]
Large language models (LLMs) and foundation models have been recently touted as a game-changer for 6G systems. This paper presents a comprehensive vision on how to design universal foundation models tailored towards the deployment of artificial intelligence (AI)-native networks.
arXiv Detail & Related papers (2024-01-30T00:21:41Z)
A Biologically-Inspired Dual Stream World Model [0.456877715768796]
The medial temporal lobe (MTL) is hypothesized to be an experience-construction system in mammals. We propose a novel variant, the Dual Stream World Model (DSWM), which learns from high-dimensional observations and dissociates them into context and content streams. We show that this representation is useful as a reinforcement learning basis function, and that the generative model can be used to aid the policy learning process using Dyna-like updates.
arXiv Detail & Related papers (2022-09-16T16:27:48Z)
One-shot Visual Reasoning on RPMs with an Application to Video Frame Prediction [1.0932251830449902]
Raven's Progressive Matrices (RPMs) are frequently used in evaluating human's visual reasoning ability. We propose a One-shot Human-Understandable ReaSoner (Os-HURS) to tackle the challenges of real-world visual recognition and subsequent logical reasoning tasks.
arXiv Detail & Related papers (2021-11-24T06:51:38Z)
Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning [49.6928533575956]
We use neural inference to mediate between the neural System 1 and the logical System 2. Results in robust story generation and grounded instruction-following show that this approach can increase the coherence and accuracy of neurally-based generations.
arXiv Detail & Related papers (2021-07-06T17:59:49Z)
Relational State-Space Model for Stochastic Multi-Object Systems [24.234120525358456]
This paper introduces the relational state-space model (R-SSM), a sequential hierarchical latent variable model. R-SSM makes use of graph neural networks (GNNs) to simulate the joint state transitions of multiple correlated objects. The utility of R-SSM is empirically evaluated on synthetic and real time-series datasets.
arXiv Detail & Related papers (2020-01-13T03:45:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.