Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
- URL: http://arxiv.org/abs/2603.04553v1
- Date: Wed, 04 Mar 2026 19:36:08 GMT
- Title: Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
- Authors: Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held,
- Abstract summary: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets.<n>LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data.<n>Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals.
- Score: 51.40150411616207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
Related papers
- Factored Latent Action World Models [39.60866765151469]
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning.<n>This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors.
arXiv Detail & Related papers (2026-02-18T07:08:14Z) - A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures [58.26804959656713]
We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs)<n>JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling.<n>We show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task.
arXiv Detail & Related papers (2026-02-03T14:56:24Z) - Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z) - MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation [18.468025471225527]
MoWM is a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning.<n>Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model.
arXiv Detail & Related papers (2025-09-26T02:54:36Z) - Latent Action Pretraining Through World Modeling [1.988007188564225]
We propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way.<n>Our framework is designed to be effective for transferring across tasks, environments, and embodiments.
arXiv Detail & Related papers (2025-09-22T21:19:10Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel model-based reinforcement learning algorithm.<n>It learns object-centric dynamics models in an unsupervised manner from pixel inputs.<n>We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over.
arXiv Detail & Related papers (2024-10-11T14:03:31Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.