Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models
- URL: http://arxiv.org/abs/2602.18639v1
- Date: Fri, 20 Feb 2026 22:19:46 GMT
- Title: Learning Invariant Visual Representations for Planning with Joint-Embedding Predictive World Models
- Authors: Leonardo F. Toso, Davit Shadunts, Yunyang Lu, Nihal Sharma, Donglin Zhan, Nam H. Nguyen, James Anderson,
- Abstract summary: We improve robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM.<n>Our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.
- Score: 9.714188952666918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: World models learned from high-dimensional visual observations allow agents to make decisions and plan directly in latent space, avoiding pixel-level reconstruction. However, recent latent predictive architectures (JEPAs), including the DINO world model (DINO-WM), display a degradation in test time robustness due to their sensitivity to "slow features". These include visual variations such as background changes and distractors that are irrelevant to the task being solved. We address this limitation by augmenting the predictive objective with a bisimulation encoder that enforces control-relevant state equivalence, mapping states with similar transition dynamics to nearby latent states while limiting contributions from slow features. We evaluate our model on a simple navigation task under different test-time background changes and visual distractors. Across all benchmarks, our model consistently improves robustness to slow features while operating in a reduced latent space, up to 10x smaller than that of DINO-WM. Moreover, our model is agnostic to the choice of pretrained visual encoder and maintains robustness when paired with DINOv2, SimDINOv2, and iBOT features.
Related papers
- NeXt2Former-CD: Efficient Remote Sensing Change Detection with Modern Vision Architectures [11.733678383805897]
NeXt2Former-CD is an end-to-end framework that integrates a Siamese ConvNeXt encoder with DINOv3 weights, a deformable attention-based temporal fusion module, and a Mask2Former decoder.<n>Our model maintains inference latency comparable to SSM-based approaches, suggesting it is practical for high-resolution change detection tasks.
arXiv Detail & Related papers (2026-02-21T04:51:53Z) - MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection [94.12444452690329]
This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities.<n>MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
arXiv Detail & Related papers (2025-11-22T06:04:29Z) - Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services [10.421371572062595]
This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception.<n>The proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score.
arXiv Detail & Related papers (2025-08-20T07:52:36Z) - DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model [2.163881720692685]
Learning-based monocular visual odometry (VO) poses robustness, generalization, and efficiency challenges in robotics.<n>Recent advances in visual foundation models, such as DINOv2, have improved robustness and generalization in various vision tasks.<n>We present DINO-VO, a feature-based VO system leveraging DINOv2 visual foundation model for its sparse feature matching.
arXiv Detail & Related papers (2025-07-17T14:09:34Z) - RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z) - Efficient Remote Sensing Change Detection with Change State Space Models [4.698129958118586]
The Change State Space Model has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images.<n>The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity.
arXiv Detail & Related papers (2025-04-15T11:25:10Z) - SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation [67.18144414660681]
We propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for online Vision-and-Language Navigation (VLN)
Our method obtains impressive performance gains on four popular benchmarks.
arXiv Detail & Related papers (2023-11-22T07:47:39Z) - Learning Robust Dynamics through Variational Sparse Gating [18.476155786474358]
In environments with many objects, often only a small number of them are moving or interacting at the same time.
In this paper, we investigate integrating this inductive bias of sparse interactions into the latent dynamics of world models trained from pixels.
arXiv Detail & Related papers (2022-10-21T02:56:51Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z) - Spatial-Temporal Alignment Network for Action Recognition and Detection [80.19235282200697]
This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection.
We propose a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection.
We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets.
arXiv Detail & Related papers (2020-12-04T06:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.