When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks
- URL: http://arxiv.org/abs/2511.06136v2
- Date: Wed, 12 Nov 2025 01:39:44 GMT
- Title: When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks
- Authors: Stefano Ferraro, Akihiro Nakano, Masahiro Suzuki, Yutaka Matsuo,
- Abstract summary: We introduce a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels.<n>DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations.<n>Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
- Score: 24.669692812050645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object-centric world models (OCWM) aim to decompose visual scenes into object-level representations, providing structured abstractions that could improve compositional generalization and data efficiency in reinforcement learning. We hypothesize that explicitly disentangled object-level representations, by localizing task-relevant information, can enhance policy performance across novel feature combinations. To test this hypothesis, we introduce DLPWM, a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels. DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations. However, when used for downstream model-based control, policies trained on DLPWM latents underperform compared to DreamerV3. Through latent-trajectory analyses, we identify representation shift during multi-object interactions as a key driver of unstable policy learning. Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
Related papers
- Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling [51.40150411616207]
We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets.<n>LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data.<n>Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals.
arXiv Detail & Related papers (2026-03-04T19:36:08Z) - Causal-JEPA: Learning World Models through Object-Level Latent Interventions [46.562961546550895]
C-JEPA is a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations.<n>By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects.
arXiv Detail & Related papers (2026-02-11T21:47:26Z) - LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - Object-Centric World Models for Causality-Aware Reinforcement Learning [13.063093054280946]
We propose emph Transformer Imagination with CAusality-aware reinforcement learning (ASTICA)<n>A unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks.<n>Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
arXiv Detail & Related papers (2025-11-18T08:53:09Z) - Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning [12.9372563969007]
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning.<n>In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information.<n>We propose the Salience-Invariant Consistent Policy Learning algorithm, an efficient framework for zero-shot generalization.
arXiv Detail & Related papers (2025-02-12T12:00:16Z) - MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes [35.16430027877207]
MOVIS aims to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS.<n>We introduce an auxiliary task requiring the model to simultaneously predict novel view object masks.<n>Our method exhibits strong generalization capabilities and produces consistent novel view synthesis.
arXiv Detail & Related papers (2024-12-16T05:23:45Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Relational Object-Centric Actor-Critic [44.99833362998488]
Recent works highlight that disentangled object representations can aid policy learning in image-based, object-centric reinforcement learning tasks.<n>This paper proposes a novel object-centric reinforcement learning algorithm that integrates actor-critic and model-based approaches.<n>We evaluate our method in a simulated 3D robotic environment and a 2D environment with compositional structure.
arXiv Detail & Related papers (2023-10-26T06:05:12Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Latent Variable Representation for Reinforcement Learning [131.03944557979725]
It remains unclear theoretically and empirically how latent variable models may facilitate learning, planning, and exploration to improve the sample efficiency of model-based reinforcement learning.
We provide a representation view of the latent variable models for state-action value functions, which allows both tractable variational learning algorithm and effective implementation of the optimism/pessimism principle.
In particular, we propose a computationally efficient planning algorithm with UCB exploration by incorporating kernel embeddings of latent variable models.
arXiv Detail & Related papers (2022-12-17T00:26:31Z) - CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation [67.12857074801731]
We introduce a novel method, CPPF++, designed for sim-to-real pose estimation.
To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty.
We incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a feature ensemble.
arXiv Detail & Related papers (2022-11-24T03:27:00Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.