INVIGORATE: Interactive Visual Grounding and Grasping in Clutter
- URL: http://arxiv.org/abs/2108.11092v2
- Date: Mon, 8 Jan 2024 02:22:44 GMT
- Title: INVIGORATE: Interactive Visual Grounding and Grasping in Clutter
- Authors: Hanbo Zhang, Yunfan Lu, Cunjun Yu, David Hsu, Xuguang Lan, Nanning
Zheng
- Abstract summary: INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
- Score: 56.00554240240515
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents INVIGORATE, a robot system that interacts with human
through natural language and grasps a specified object in clutter. The objects
may occlude, obstruct, or even stack on top of one another. INVIGORATE embodies
several challenges: (i) infer the target object among other occluding objects,
from input language expressions and RGB images, (ii) infer object blocking
relationships (OBRs) from the images, and (iii) synthesize a multi-step plan to
ask questions that disambiguate the target object and to grasp it successfully.
We train separate neural networks for object detection, for visual grounding,
for question generation, and for OBR detection and grasping. They allow for
unrestricted object categories and language expressions, subject to the
training datasets. However, errors in visual perception and ambiguity in human
languages are inevitable and negatively impact the robot's performance. To
overcome these uncertainties, we build a partially observable Markov decision
process (POMDP) that integrates the learned neural network modules. Through
approximate POMDP planning, the robot tracks the history of observations and
asks disambiguation questions in order to achieve a near-optimal sequence of
actions that identify and grasp the target object. INVIGORATE combines the
benefits of model-based POMDP planning and data-driven deep learning.
Preliminary experiments with INVIGORATE on a Fetch robot show significant
benefits of this integrated approach to object grasping in clutter with natural
language interactions. A demonstration video is available at
https://youtu.be/zYakh80SGcU.
Related papers
- Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance [13.246380364455494]
We present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds.
The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones.
Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language.
arXiv Detail & Related papers (2024-07-18T18:24:51Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - PROGrasp: Pragmatic Human-Robot Communication for Object Grasping [22.182690439449278]
Interactive Object Grasping (IOG) is the task of identifying and grasping the desired object via human-robot natural language interaction.
Inspired by pragmatics, we introduce a new IOG task, Pragmatic-IOG, and the corresponding dataset, Intention-oriented Multi-modal Dialogue (IM-Dial)
Prograsp performs Pragmatic-IOG by incorporating modules for visual grounding, question asking, object grasping, and most importantly, answer interpretation for pragmatic inference.
arXiv Detail & Related papers (2023-09-14T14:45:47Z) - Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z) - Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding [42.04502185508723]
We propose a new large Language-guided SHape grAsPing datasEt to promote 3D part-level affordance and grasping ability learning.
From the perspective of robotic cognition, we design a two-stage fine-grained robotic grasping framework (named LangPartGPD)
Our method combines the advantages of human-robot collaboration and large language models (LLMs)
Results show our method achieves competitive performance in 3D geometry fine-grained grounding, object affordance inference, and 3D part-aware grasping tasks.
arXiv Detail & Related papers (2023-01-27T07:00:54Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.