Robo-ABC: Affordance Generalization Beyond Categories via Semantic
Correspondence for Robot Manipulation
- URL: http://arxiv.org/abs/2401.07487v1
- Date: Mon, 15 Jan 2024 06:02:30 GMT
- Title: Robo-ABC: Affordance Generalization Beyond Categories via Semantic
Correspondence for Robot Manipulation
- Authors: Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Mingrun Jiang, Huazhe
Xu
- Abstract summary: We present Robo-ABC, a framework for robotic manipulation that generalizes to out-of-distribution scenes.
We show that Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin.
Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.
- Score: 20.69293648286978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enabling robotic manipulation that generalizes to out-of-distribution scenes
is a crucial step toward open-world embodied intelligence. For human beings,
this ability is rooted in the understanding of semantic correspondence among
objects, which naturally transfers the interaction experience of familiar
objects to novel ones. Although robots lack such a reservoir of interaction
experience, the vast availability of human videos on the Internet may serve as
a valuable resource, from which we extract an affordance memory including the
contact points. Inspired by the natural way humans think, we propose Robo-ABC:
when confronted with unfamiliar objects that require generalization, the robot
can acquire affordance by retrieving objects that share visual or semantic
similarities from the affordance memory. The next step is to map the contact
points of the retrieved objects to the new object. While establishing this
correspondence may present formidable challenges at first glance, recent
research finds it naturally arises from pre-trained diffusion models, enabling
affordance mapping even across disparate object categories. Through the
Robo-ABC framework, robots may generalize to manipulate out-of-category objects
in a zero-shot manner without any manual annotation, additional training, part
segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively,
Robo-ABC significantly enhances the accuracy of visual affordance retrieval by
a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end
affordance models. We also conduct real-world experiments of cross-category
object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its
capacity for real-world tasks.
Related papers
- Compositional Zero-Shot Learning for Attribute-Based Object Reference in
Human-Robot Interaction [0.0]
Language-enabled robots must be able to comprehend referring expressions to identify a particular object from visual perception.
Visual observations of an object may not be available when it is referred to, and the number of objects and attributes may also be unbounded in open worlds.
We implement an attribute-based zero-shot learning method that uses a list of attributes to perform referring expression comprehension in open worlds.
arXiv Detail & Related papers (2023-12-21T08:29:41Z) - Teaching Unknown Objects by Leveraging Human Gaze and Augmented Reality
in Human-Robot Interaction [3.1473798197405953]
This dissertation aims to teach a robot unknown objects in the context of Human-Robot Interaction (HRI)
The combination of eye tracking and Augmented Reality created a powerful synergy that empowered the human teacher to communicate with the robot.
The robot's object detection capabilities exhibited comparable performance to state-of-the-art object detectors trained on extensive datasets.
arXiv Detail & Related papers (2023-12-12T11:34:43Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - DemoGrasp: Few-Shot Learning for Robotic Grasping with Human
Demonstration [42.19014385637538]
We propose to teach a robot how to grasp an object with a simple and short human demonstration.
We first present a small sequence of RGB-D images displaying a human-object interaction.
This sequence is then leveraged to build associated hand and object meshes that represent the interaction.
arXiv Detail & Related papers (2021-12-06T08:17:12Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Simultaneous Multi-View Object Recognition and Grasping in Open-Ended
Domains [0.0]
We propose a deep learning architecture with augmented memory capacities to handle open-ended object recognition and grasping simultaneously.
We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings.
arXiv Detail & Related papers (2021-06-03T14:12:11Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z) - Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs [90.20235972293801]
Aiming to understand how human (false-temporal)-belief-a core socio-cognitive ability unify-would affect human interactions with robots, this paper proposes to adopt a graphical model to the representation of object states, robot knowledge, and human (false-)beliefs.
An inference algorithm is derived to fuse individual pg from all robots across multi-views into a joint pg, which affords more effective reasoning inference capability to overcome the errors originated from a single view.
arXiv Detail & Related papers (2020-04-25T23:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.