Related papers: View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

URL: http://arxiv.org/abs/2409.03685v1
Date: Thu, 5 Sep 2024 16:39:21 GMT
Title: View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Authors: Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, Jiajun Wu,
Abstract summary: We investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. We study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments.
Score: 26.231630397802785
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.

Related papers

Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z)
NVSPolicy: Adaptive Novel-View Synthesis for Generalizable Language-Conditioned Policy Learning [10.880824035303176]
We introduce NVSPolicy, a generalizable language-conditioned policy learning method that couples an adaptive novel-view synthesis module with a hierarchical policy network.<n>We show that NVSPolicy achieves an average success rate of 90.4% across all tasks, greatly outperforming the recent methods.<n>In addition, we evaluate NVSPolicy on a real-world robotic platform to demonstrate its practical applicability.
arXiv Detail & Related papers (2025-05-15T14:51:14Z)
Rendering Anywhere You See: Renderability Field-guided Gaussian Splatting [4.89907242398523]
We propose renderability field-guided gaussian splatting (RF-GS) for scene view synthesis. RF-GS quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. Our experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.
arXiv Detail & Related papers (2025-04-27T14:41:01Z)
SimVS: Simulating World Inconsistencies for Robust View Synthesis [102.83898965828621]
We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations.
arXiv Detail & Related papers (2024-12-10T17:35:12Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Dreamitate: Real-World Visuomotor Policy Learning via Video Generation [49.03287909942888]
We propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. We generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot.
arXiv Detail & Related papers (2024-06-24T17:59:45Z)
Learning Generalizable Manipulation Policies with Object-Centric 3D Representations [65.55352131167213]
GROOT is an imitation learning method for learning robust policies with object-centric and 3D priors. It builds policies that generalize beyond their initial training conditions for vision-based manipulation. GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances.
arXiv Detail & Related papers (2023-10-22T18:51:45Z)
Visual-Policy Learning through Multi-Camera View to Single-Camera View Knowledge Distillation for Robot Manipulation Tasks [4.820787231200527]
We present a novel approach to enhance the generalization performance of vision-based Reinforcement Learning (RL) algorithms for robotic manipulation tasks. Our proposed method involves utilizing a technique known as knowledge distillation, in which a pre-trained teacher'' policy trained with multiple camera viewpoints guides a student'' policy in learning from a single camera viewpoint. The results demonstrate that the single-view visual student policy can successfully learn to grasp and lift a challenging object, which was not possible with a single-view policy alone.
arXiv Detail & Related papers (2023-03-13T11:42:38Z)
Novel View Synthesis from a Single Image via Unsupervised learning [27.639536023956122]
We propose an unsupervised network to learn such a pixel transformation from a single source viewpoint. The learned transformation allows us to synthesize a novel view from any single source viewpoint image of unknown pose.
arXiv Detail & Related papers (2021-10-29T06:32:49Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)
Seeing All the Angles: Learning Multiview Manipulation Policies for Contact-Rich Tasks from Demonstrations [7.51557557629519]
A successful multiview policy could be deployed on a mobile manipulation platform. We demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints. We show that learning from multiview data has little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data.
arXiv Detail & Related papers (2021-04-28T17:43:29Z)
Neural Body: Implicit Neural Representations with Structured Latent Codes for Novel View Synthesis of Dynamic Humans [56.63912568777483]
This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. We propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh. Experiments on ZJU-MoCap show that our approach outperforms prior works by a large margin in terms of novel view synthesis quality.
arXiv Detail & Related papers (2020-12-31T18:55:38Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.