Related papers: Causal-JEPA: Learning World Models through Object-Level Latent Interventions

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

URL: http://arxiv.org/abs/2602.11389v1
Date: Wed, 11 Feb 2026 21:47:26 GMT
Title: Causal-JEPA: Learning World Models through Object-Level Latent Interventions
Authors: Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero,
Abstract summary: C-JEPA is a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations.<n>By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects.
Score: 46.562961546550895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.

Related papers

VJEPA: Variational Joint Embedding Predictive Architectures as Probabilistic World Models [0.0]
We introduce emphVariational JEPA (VJEPA), a textitprobabilistic generalization that learns a predictive distribution over future latent states via a variational objective.<n>VJEPA representations can serve as sufficient information states for optimal control without pixel reconstruction, while providing formal guarantees for collapse avoidance.<n>We propose emphBayesian JEPA (BJEPA), an extension that factorizes the predictive belief into a learned dynamics expert and a modular prior expert.
arXiv Detail & Related papers (2026-01-20T18:04:16Z)
Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation [14.262846967061947]
Fine-grained Correspondence Pose Estimation (FiCoP) is a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence.<n>FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method.
arXiv Detail & Related papers (2026-01-20T03:48:54Z)
Object-Centric World Models for Causality-Aware Reinforcement Learning [13.063093054280946]
We propose emph Transformer Imagination with CAusality-aware reinforcement learning (ASTICA)<n>A unified framework in which object-centric Transformers serve as the world model and causality-aware policy and value networks.<n>Experiments on object-rich benchmarks demonstrate that STICA consistently outperforms state-of-the-art agents in both sample efficiency and final performance.
arXiv Detail & Related papers (2025-11-18T08:53:09Z)
When Object-Centric World Models Meet Policy Learning: From Pixels to Policies, and Where It Breaks [24.669692812050645]
We introduce a fully unsupervised, disentangled object-centric world model that learns object-level latents directly from pixels.<n>DLPWM achieves strong reconstruction and prediction performance, including robustness to several out-of-distribution (OOD) visual variations.<n>Our results suggest that, although object-centric perception supports robust visual modeling, achieving stable control requires mitigating latent drift.
arXiv Detail & Related papers (2025-11-08T21:09:44Z)
ACT-JEPA: Novel Joint-Embedding Predictive Architecture for Efficient Policy Representation Learning [90.41852663775086]
ACT-JEPA is a novel architecture that integrates imitation learning and self-supervised learning.<n>We train a policy to predict action sequences and abstract observation sequences.<n>Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics.
arXiv Detail & Related papers (2025-01-24T16:41:41Z)
Seamless Detection: Unifying Salient Object Detection and Camouflaged Object Detection [73.85890512959861]
We propose a task-agnostic framework to unify Salient Object Detection (SOD) and Camouflaged Object Detection (COD)<n>We design a simple yet effective contextual decoder involving the interval-layer and global context, which achieves an inference speed of 67 fps.<n> Experiments on public SOD and COD datasets demonstrate the superiority of our proposed framework in both supervised and unsupervised settings.
arXiv Detail & Related papers (2024-12-22T03:25:43Z)
Object-centric proto-symbolic behavioural reasoning from pixels [0.0]
We present a brain-inspired, deep-learning architecture that learns from pixels to interpret, control, and reason about its environment.<n>Results show that the agent can learn emergent conditional behavioural reasoning.<n>The proposed architecture shows how to manipulate grounded object representations, as a key inductive bias for unsupervised learning.
arXiv Detail & Related papers (2024-11-26T13:54:24Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones. SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders. Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.