Compositional Zero-Shot Learning for Attribute-Based Object Reference in
Human-Robot Interaction
- URL: http://arxiv.org/abs/2312.13655v1
- Date: Thu, 21 Dec 2023 08:29:41 GMT
- Title: Compositional Zero-Shot Learning for Attribute-Based Object Reference in
Human-Robot Interaction
- Authors: Peng Gao (1), Ahmed Jaafar (1), Brian Reily (2), Christopher Reardon
(3), Hao Zhang (1) ((1) University of Massachusetts Amherst, (2) DEVCOM Army
Research Laboratory, (3) University of Denver)
- Abstract summary: Language-enabled robots must be able to comprehend referring expressions to identify a particular object from visual perception.
Visual observations of an object may not be available when it is referred to, and the number of objects and attributes may also be unbounded in open worlds.
We implement an attribute-based zero-shot learning method that uses a list of attributes to perform referring expression comprehension in open worlds.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-enabled robots have been widely studied over the past years to
enable natural human-robot interaction and teaming in various real-world
applications. Language-enabled robots must be able to comprehend referring
expressions to identify a particular object from visual perception using a set
of referring attributes extracted from natural language. However, visual
observations of an object may not be available when it is referred to, and the
number of objects and attributes may also be unbounded in open worlds. To
address the challenges, we implement an attribute-based compositional zero-shot
learning method that uses a list of attributes to perform referring expression
comprehension in open worlds. We evaluate the approach on two datasets
including the MIT-States and the Clothing 16K. The preliminary experimental
results show that our implemented approach allows a robot to correctly identify
the objects referred to by human commands.
Related papers
- Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - Learning Object Properties Using Robot Proprioception via Differentiable Robot-Object Interaction [52.12746368727368]
Differentiable simulation has become a powerful tool for system identification.
Our approach calibrates object properties by using information from the robot, without relying on data from the object itself.
We demonstrate the effectiveness of our method on a low-cost robotic platform.
arXiv Detail & Related papers (2024-10-04T20:48:38Z) - Robo-ABC: Affordance Generalization Beyond Categories via Semantic
Correspondence for Robot Manipulation [20.69293648286978]
We present Robo-ABC, a framework for robotic manipulation that generalizes to out-of-distribution scenes.
We show that Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin.
Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.
arXiv Detail & Related papers (2024-01-15T06:02:30Z) - Teaching Unknown Objects by Leveraging Human Gaze and Augmented Reality
in Human-Robot Interaction [3.1473798197405953]
This dissertation aims to teach a robot unknown objects in the context of Human-Robot Interaction (HRI)
The combination of eye tracking and Augmented Reality created a powerful synergy that empowered the human teacher to communicate with the robot.
The robot's object detection capabilities exhibited comparable performance to state-of-the-art object detectors trained on extensive datasets.
arXiv Detail & Related papers (2023-12-12T11:34:43Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach [0.0]
We present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation.
A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction.
We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes.
arXiv Detail & Related papers (2022-10-03T12:21:45Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Language Grounding with 3D Objects [60.67796160959387]
We introduce a novel reasoning task that targets both visual and non-visual language about 3D objects.
We introduce several CLIP-based models for distinguishing objects.
We find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
arXiv Detail & Related papers (2021-07-26T23:35:58Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs [90.20235972293801]
Aiming to understand how human (false-temporal)-belief-a core socio-cognitive ability unify-would affect human interactions with robots, this paper proposes to adopt a graphical model to the representation of object states, robot knowledge, and human (false-)beliefs.
An inference algorithm is derived to fuse individual pg from all robots across multi-views into a joint pg, which affords more effective reasoning inference capability to overcome the errors originated from a single view.
arXiv Detail & Related papers (2020-04-25T23:02:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.