Learning to Act Robustly with View-Invariant Latent Actions
- URL: http://arxiv.org/abs/2601.02994v1
- Date: Tue, 06 Jan 2026 13:14:01 GMT
- Title: Learning to Act Robustly with View-Invariant Latent Actions
- Authors: Youngjoon Jeong, Junha Chun, Taesup Kim,
- Abstract summary: Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations.<n>We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics.<n> Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks.
- Score: 8.446887947386559
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-based robotic policies often struggle with even minor viewpoint changes, underscoring the need for view-invariant visual representations. This challenge becomes more pronounced in real-world settings, where viewpoint variability is unavoidable and can significantly disrupt policy performance. Existing methods typically learn invariance from multi-view observations at the scene level, but such approaches rely on visual appearance and fail to incorporate the physical dynamics essential for robust generalization. We propose View-Invariant Latent Action (VILA), which models a latent action capturing transition patterns across trajectories to learn view-invariant representations grounded in physical dynamics. VILA aligns these latent actions across viewpoints using an action-guided objective based on ground-truth action sequences. Experiments in both simulation and the real world show that VILA-based policies generalize effectively to unseen viewpoints and transfer well to new tasks, establishing VILA as a strong pretraining framework that improves robustness and downstream learning performance.
Related papers
- \ extsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation [50.027425808733994]
textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
arXiv Detail & Related papers (2026-01-26T06:16:17Z) - Do-Undo: Generating and Reversing Physical Actions in Vision-Language Models [57.71440995598757]
We introduce the Do-Undo task and benchmark to address a critical gap in vision-language models.<n>Do-Undo requires models to simulate the outcome of a physical action and then accurately reverse it, reflecting true cause-and-effect in the visual world.
arXiv Detail & Related papers (2025-12-15T18:03:42Z) - PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention [92.85371254435074]
PosA-VLA framework anchors visual attention via pose-conditioned supervision, consistently guiding the model's perception toward task-relevant regions.<n>We show that our method executes embodied tasks with precise and time-efficient behavior across diverse robotic manipulation benchmarks.
arXiv Detail & Related papers (2025-12-03T12:14:29Z) - Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues [69.24378760740171]
We address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes.<n>We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues.<n>Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies.
arXiv Detail & Related papers (2025-11-13T19:31:05Z) - VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization [3.131272328696594]
VisionLaw is a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations.<n>It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.
arXiv Detail & Related papers (2025-08-19T12:52:16Z) - Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z) - DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation [16.863534382288705]
We propose a novel framework that enhances action learning by jointly predicting future states and adapting to dynamics variations based on historical trajectories.<n>DyWA achieves an average success rate of 68% in real-world experiments.
arXiv Detail & Related papers (2025-03-21T02:29:52Z) - Generalization in Visual Reinforcement Learning with the Reward Sequence
Distribution [98.67737684075587]
Generalization in partially observed markov decision processes (POMDPs) is critical for successful applications of visual reinforcement learning (VRL)
We propose the reward sequence distribution conditioned on the starting observation and the predefined subsequent action sequence (RSD-OA)
Experiments demonstrate that our representation learning approach based on RSD-OA significantly improves the generalization performance on unseen environments.
arXiv Detail & Related papers (2023-02-19T15:47:24Z) - Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning.
Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents.
We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.