Related papers: Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

URL: http://arxiv.org/abs/2603.04553v1
Date: Wed, 04 Mar 2026 19:36:08 GMT
Title: Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Authors: Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held,
Abstract summary: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets.<n>LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data.<n>Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals.
Score: 51.40150411616207
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web

Related papers

Factored Latent Action World Models [39.60866765151469]
Learning latent actions from action-free video has emerged as a powerful paradigm for scaling up controllable world model learning.<n>This paper introduces Factored Latent Action Model (FLAM), a factored dynamics framework that decomposes the scene into independent factors.
arXiv Detail & Related papers (2026-02-18T07:08:14Z)
A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures [58.26804959656713]
We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs)<n>JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling.<n>We show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task.
arXiv Detail & Related papers (2026-02-03T14:56:24Z)
Can World Models Benefit VLMs for World Dynamics? [59.73433292793044]
We investigate the capabilities when world model priors are transferred into Vision-Language Models.<n>We name our best-performing variant Dynamic Vision Aligner (DyVA)<n>We find DyVA to surpass both open-source and proprietary baselines, achieving state-of-the-art or comparable performance.
arXiv Detail & Related papers (2025-10-01T13:07:05Z)
MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation [18.468025471225527]
MoWM is a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning.<n>Our approach uses motion-aware representations from a latent model as a high-level prior, which guides the extraction of fine-grained visual features from the pixel space model.
arXiv Detail & Related papers (2025-09-26T02:54:36Z)
Latent Action Pretraining Through World Modeling [1.988007188564225]
We propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way.<n>Our framework is designed to be effective for transferring across tasks, environments, and embodiments.
arXiv Detail & Related papers (2025-09-22T21:19:10Z)
Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z)
SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels [16.020835290802548]
Slot-Attention for Object-centric Latent Dynamics is a novel model-based reinforcement learning algorithm.<n>It learns object-centric dynamics models in an unsupervised manner from pixel inputs.<n>We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over.
arXiv Detail & Related papers (2024-10-11T14:03:31Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.