AnyPlace: Learning Generalized Object Placement for Robot Manipulation
- URL: http://arxiv.org/abs/2502.04531v1
- Date: Thu, 06 Feb 2025 22:04:13 GMT
- Title: AnyPlace: Learning Generalized Object Placement for Robot Manipulation
- Authors: Yuchi Zhao, Miroslav Bogdanovic, Chengyuan Luo, Steven Tohme, Kourosh Darvish, Alán Aspuru-Guzik, Florian Shkurti, Animesh Garg,
- Abstract summary: We propose AnyPlace, a two-stage method trained entirely on synthetic data.
Our key insight is that by leveraging a Vision-Language Model, we focus only on the relevant regions for local placement.
For training, we generate a fully synthetic dataset of randomly generated objects in different placement configurations.
In real-world experiments, we show how our approach directly transfers models trained purely on synthetic data to the real world.
- Score: 37.725807003481904
- License:
- Abstract: Object placement in robotic tasks is inherently challenging due to the diversity of object geometries and placement configurations. To address this, we propose AnyPlace, a two-stage method trained entirely on synthetic data, capable of predicting a wide range of feasible placement poses for real-world tasks. Our key insight is that by leveraging a Vision-Language Model (VLM) to identify rough placement locations, we focus only on the relevant regions for local placement, which enables us to train the low-level placement-pose-prediction model to capture diverse placements efficiently. For training, we generate a fully synthetic dataset of randomly generated objects in different placement configurations (insertion, stacking, hanging) and train local placement-prediction models. We conduct extensive evaluations in simulation, demonstrating that our method outperforms baselines in terms of success rate, coverage of possible placement modes, and precision. In real-world experiments, we show how our approach directly transfers models trained purely on synthetic data to the real world, where it successfully performs placements in scenarios where other models struggle -- such as with varying object geometries, diverse placement modes, and achieving high precision for fine placement. More at: https://any-place.github.io.
Related papers
- Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy.
Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z) - Pre-training Contextual Location Embeddings in Personal Trajectories via
Efficient Hierarchical Location Representations [30.493743596793212]
Pre-training the embedding of a location generated from human mobility data has become a popular method for location based services.
Previous studies have handled less than ten thousand distinct locations, which is insufficient in the real-world applications.
We propose a Geo-Tokenizer, designed to efficiently reduce the number of locations to be trained by representing a location as a combination of several grids at different scales.
arXiv Detail & Related papers (2023-10-02T14:40:24Z) - One-Shot Domain Adaptive and Generalizable Semantic Segmentation with
Class-Aware Cross-Domain Transformers [96.51828911883456]
Unsupervised sim-to-real domain adaptation (UDA) for semantic segmentation aims to improve the real-world test performance of a model trained on simulated data.
Traditional UDA often assumes that there are abundant unlabeled real-world data samples available during training for the adaptation.
We explore the one-shot unsupervised sim-to-real domain adaptation (OSUDA) and generalization problem, where only one real-world data sample is available.
arXiv Detail & Related papers (2022-12-14T15:54:15Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - Predicting Stable Configurations for Semantic Placement of Novel Objects [37.18437299513799]
Our goal is to enable robots to repose previously unseen objects according to learned semantic relationships in novel environments.
We build our models and training from the ground up to be tightly integrated with our proposed planning algorithm for semantic placement of unknown objects.
Our approach enables motion planning for semantic rearrangement of unknown objects in scenes with varying geometry from only RGB-D sensing.
arXiv Detail & Related papers (2021-08-26T23:05:05Z) - PyraPose: Feature Pyramids for Fast and Accurate Object Pose Estimation
under Domain Shift [26.037061005620263]
We argue that patch-based approaches, instead of encoder-decoder networks, are more suited for synthetic-to-real transfer.
We present a novel approach based on a specialized feature pyramid network to compute multi-scale features for creating pose hypotheses.
Our single-shot pose estimation approach is evaluated on multiple standard datasets and outperforms the state of the art by up to 35%.
arXiv Detail & Related papers (2020-10-30T08:26:22Z) - DASGIL: Domain Adaptation for Semantic and Geometric-aware Image-based
Localization [27.294822556484345]
Long-term visual localization under changing environments is a challenging problem in autonomous driving and mobile robotics.
We propose a novel multi-task architecture to fuse the geometric and semantic information into the multi-scale latent embedding representation for visual place recognition.
arXiv Detail & Related papers (2020-10-01T17:44:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.