View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
- URL: http://arxiv.org/abs/2409.03685v2
- Date: Wed, 19 Feb 2025 21:10:44 GMT
- Title: View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
- Authors: Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, Jiajun Wu,
- Abstract summary: We investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint.
We study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints.
For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments.
- Score: 26.231630397802785
- License:
- Abstract: Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
Related papers
- SimVS: Simulating World Inconsistencies for Robust View Synthesis [102.83898965828621]
We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture.
We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations.
arXiv Detail & Related papers (2024-12-10T17:35:12Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Dreamitate: Real-World Visuomotor Policy Learning via Video Generation [49.03287909942888]
We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task.
We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
arXiv Detail & Related papers (2024-06-24T17:59:45Z) - Learning Generalizable Manipulation Policies with Object-Centric 3D
Representations [65.55352131167213]
GROOT is an imitation learning method for learning robust policies with object-centric and 3D priors.
It builds policies that generalize beyond their initial training conditions for vision-based manipulation.
GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances.
arXiv Detail & Related papers (2023-10-22T18:51:45Z) - Visual-Policy Learning through Multi-Camera View to Single-Camera View
Knowledge Distillation for Robot Manipulation Tasks [4.820787231200527]
We present a novel approach to enhance the generalization performance of vision-based Reinforcement Learning (RL) algorithms for robotic manipulation tasks.
Our proposed method involves utilizing a technique known as knowledge distillation, in which a pre-trained teacher'' policy trained with multiple camera viewpoints guides a student'' policy in learning from a single camera viewpoint.
The results demonstrate that the single-view visual student policy can successfully learn to grasp and lift a challenging object, which was not possible with a single-view policy alone.
arXiv Detail & Related papers (2023-03-13T11:42:38Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Seeing All the Angles: Learning Multiview Manipulation Policies for
Contact-Rich Tasks from Demonstrations [7.51557557629519]
A successful multiview policy could be deployed on a mobile manipulation platform.
We demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints.
We show that learning from multiview data has little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data.
arXiv Detail & Related papers (2021-04-28T17:43:29Z) - Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching.
Our approach learns entirely using offline, unlabeled data.
We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.