Robot Object Retrieval with Contextual Natural Language Queries
- URL: http://arxiv.org/abs/2006.13253v1
- Date: Tue, 23 Jun 2020 18:13:40 GMT
- Title: Robot Object Retrieval with Contextual Natural Language Queries
- Authors: Thao Nguyen, Nakul Gopalan, Roma Patel, Matt Corsaro, Ellie Pavlick,
Stefanie Tellex
- Abstract summary: We develop a model to retrieve objects based on descriptions of their usage.
Our model directly predicts an object's appearance from the object's use specified by a verb phrase.
Based on contextual information present in the language commands, our model can generalize to unseen object classes and unknown nouns.
- Score: 26.88600852700681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language object retrieval is a highly useful yet challenging task for
robots in human-centric environments. Previous work has primarily focused on
commands specifying the desired object's type such as "scissors" and/or visual
attributes such as "red," thus limiting the robot to only known object classes.
We develop a model to retrieve objects based on descriptions of their usage.
The model takes in a language command containing a verb, for example "Hand me
something to cut," and RGB images of candidate objects and selects the object
that best satisfies the task specified by the verb. Our model directly predicts
an object's appearance from the object's use specified by a verb phrase. We do
not need to explicitly specify an object's class label. Our approach allows us
to predict high level concepts like an object's utility based on the language
query. Based on contextual information present in the language commands, our
model can generalize to unseen object classes and unknown nouns in the
commands. Our model correctly selects objects out of sets of five candidates to
fulfill natural language commands, and achieves an average accuracy of 62.3% on
a held-out test set of unseen ImageNet object classes and 53.0% on unseen
object classes and unknown nouns. Our model also achieves an average accuracy
of 54.7% on unseen YCB object classes, which have a different image
distribution from ImageNet objects. We demonstrate our model on a KUKA LBR iiwa
robot arm, enabling the robot to retrieve objects based on natural language
descriptions of their usage. We also present a new dataset of 655 verb-object
pairs denoting object usage over 50 verbs and 216 object classes.
Related papers
- Skill Generalization with Verbs [20.90116318432194]
It is imperative that robots can understand natural language commands issued by humans.
We propose a method for generalizing manipulation skills to novel objects using verbs.
We show that our model can generate trajectories that are usable for executing five verb commands applied to novel instances of two different object categories on a real robot.
arXiv Detail & Related papers (2024-10-18T02:12:18Z) - Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding [77.26626173589746]
We present the Multi-view Approach to Grounding in Context (MAGiC)
It selects an object referent based on language that distinguishes between two similar objects.
It improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9%.
arXiv Detail & Related papers (2023-11-12T00:21:58Z) - ShapeShift: Superquadric-based Object Pose Estimation for Robotic
Grasping [85.38689479346276]
Current techniques heavily rely on a reference 3D object, limiting their generalizability and making it expensive to expand to new object categories.
This paper proposes ShapeShift, a superquadric-based framework for object pose estimation that predicts the object's pose relative to a primitive shape which is fitted to the object.
arXiv Detail & Related papers (2023-04-10T20:55:41Z) - DoUnseen: Tuning-Free Class-Adaptive Object Detection of Unseen Objects
for Robotic Grasping [1.6317061277457001]
We develop an object detector that requires no fine-tuning and can add any object as a class just by capturing a few images of the object.
We evaluate our class-adaptive object detector on unseen datasets and compare it to a trained Mask R-CNN on those datasets.
arXiv Detail & Related papers (2023-04-06T02:45:39Z) - Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions [74.63313641583602]
This paper studies the task of any objects grasping from the known categories by free-form language instructions.
We bring these disciplines together on this open challenge, which is essential to human-robot interaction.
We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
arXiv Detail & Related papers (2022-05-09T04:25:14Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Few-shot Object Grounding and Mapping for Natural Language Robot
Instruction Following [15.896892723068932]
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects.
We introduce a few-shot language-conditioned object grounding method trained from augmented reality data.
We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output.
arXiv Detail & Related papers (2020-11-14T20:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.