Self-Supervised Object Goal Navigation with In-Situ Finetuning
- URL: http://arxiv.org/abs/2212.05923v2
- Date: Sun, 2 Apr 2023 01:39:47 GMT
- Title: Self-Supervised Object Goal Navigation with In-Situ Finetuning
- Authors: So Yeon Min, Yao-Hung Hubert Tsai, Wei Ding, Ali Farhadi, Ruslan
Salakhutdinov, Yonatan Bisk, Jian Zhang
- Abstract summary: This work builds an agent that builds self-supervised models of the world via exploration.
We identify a strong source of self-supervision that can train all components of an ObjectNav agent.
We show that our agent can perform competitively in the real world and simulation.
- Score: 110.6053241629366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A household robot should be able to navigate to target objects without
requiring users to first annotate everything in their home. Most current
approaches to object navigation do not test on real robots and rely solely on
reconstructed scans of houses and their expensively labeled semantic 3D meshes.
In this work, our goal is to build an agent that builds self-supervised models
of the world via exploration, the same as a child might - thus we (1) eschew
the expense of labeled 3D mesh and (2) enable self-supervised in-situ
finetuning in the real world. We identify a strong source of self-supervision
(Location Consistency - LocCon) that can train all components of an ObjectNav
agent, using unannotated simulated houses. Our key insight is that embodied
agents can leverage location consistency as a self-supervision signal -
collecting images from different views/angles and applying contrastive
learning. We show that our agent can perform competitively in the real world
and simulation. Our results also indicate that supervised training with 3D mesh
annotations causes models to learn simulation artifacts, which are not
transferrable to the real world. In contrast, our LocCon shows the most robust
transfer in the real world among the set of models we compare to, and that the
real-world performance of all models can be further improved with
self-supervised LocCon in-situ training.
Related papers
- Stimulating Imagination: Towards General-purpose Object Rearrangement [2.0885207827639785]
General-purpose object placement is a fundamental capability of intelligent robots.
We propose a framework named SPORT to accomplish this task.
Sport learns a diffusion-based 3D pose estimator to ensure physically-realistic results.
A set of simulation and real-world experiments demonstrate the potential of our approach to accomplish general-purpose object rearrangement.
arXiv Detail & Related papers (2024-08-03T03:53:05Z) - Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models [12.965144877139393]
We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline.
This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered.
These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place.
arXiv Detail & Related papers (2023-12-07T18:51:19Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Visual Reinforcement Learning with Self-Supervised 3D Representations [15.991546692872841]
We present a unified framework for self-supervised learning of 3D representations for motor control.
Our method enjoys improved sample efficiency in simulated manipulation tasks compared to 2D representation learning methods.
arXiv Detail & Related papers (2022-10-13T17:59:55Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - SEAL: Self-supervised Embodied Active Learning using Exploration and 3D
Consistency [122.18108118190334]
We present a framework called Self- Embodied Embodied Active Learning (SEAL)
It utilizes perception models trained on internet images to learn an active exploration policy.
We and build utilize 3D semantic maps to learn both action and perception in a completely self-supervised manner.
arXiv Detail & Related papers (2021-12-02T06:26:38Z) - 3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations.
A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z) - Out of the Box: Embodied Navigation in the Real World [45.97756658635314]
We show how to transfer knowledge acquired in simulation into the real world.
We deploy our models on a LoCoBot equipped with a single Intel RealSense camera.
Our experiments indicate that it is possible to achieve satisfying results when deploying the obtained model in the real world.
arXiv Detail & Related papers (2021-05-12T18:00:14Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.