Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions
- URL: http://arxiv.org/abs/2205.04028v1
- Date: Mon, 9 May 2022 04:25:14 GMT
- Title: Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions
- Authors: Chilam Cheang, Haitao Lin, Yanwei Fu, Xiangyang Xue
- Abstract summary: This paper studies the task of any objects grasping from the known categories by free-form language instructions.
We bring these disciplines together on this open challenge, which is essential to human-robot interaction.
We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
- Score: 74.63313641583602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies the task of any objects grasping from the known categories
by free-form language instructions. This task demands the technique in computer
vision, natural language processing, and robotics. We bring these disciplines
together on this open challenge, which is essential to human-robot interaction.
Critically, the key challenge lies in inferring the category of objects from
linguistic instructions and accurately estimating the 6-DoF information of
unseen objects from the known classes. In contrast, previous works focus on
inferring the pose of object candidates at the instance level. This
significantly limits its applications in real-world scenarios.In this paper, we
propose a language-guided 6-DoF category-level object localization model to
achieve robotic grasping by comprehending human intention. To this end, we
propose a novel two-stage method. Particularly, the first stage grounds the
target in the RGB image through language description of names, attributes, and
spatial relations of objects. The second stage extracts and segments point
clouds from the cropped depth image and estimates the full 6-DoF object pose at
category-level. Under such a manner, our approach can locate the specific
object by following human instructions, and estimate the full 6-DoF pose of a
category-known but unseen instance which is not utilized for training the
model. Extensive experimental results show that our method is competitive with
the state-of-the-art language-conditioned grasp method. Importantly, we deploy
our approach on a physical robot to validate the usability of our framework in
real-world applications. Please refer to the supplementary for the demo videos
of our robot experiments.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance [13.246380364455494]
We present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds.
The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones.
Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language.
arXiv Detail & Related papers (2024-07-18T18:24:51Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Simultaneous Multi-View Object Recognition and Grasping in Open-Ended
Domains [0.0]
We propose a deep learning architecture with augmented memory capacities to handle open-ended object recognition and grasping simultaneously.
We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings.
arXiv Detail & Related papers (2021-06-03T14:12:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.