Differentiable Parsing and Visual Grounding of Verbal Instructions for
Object Placement
- URL: http://arxiv.org/abs/2210.00215v1
- Date: Sat, 1 Oct 2022 07:36:51 GMT
- Title: Differentiable Parsing and Visual Grounding of Verbal Instructions for
Object Placement
- Authors: Zirui Zhao, Wee Sun Lee, David Hsu
- Abstract summary: We introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement.
It parses language instructions into relations between objects and grounds those objects in visual scenes.
ParaGon encodes all of those procedures into neural networks for end-to-end training.
- Score: 26.74189486483276
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Grounding spatial relations in natural language for object placing could have
ambiguity and compositionality issues. To address the issues, we introduce
ParaGon, a PARsing And visual GrOuNding framework for language-conditioned
object placement. It parses language instructions into relations between
objects and grounds those objects in visual scenes. A particle-based GNN then
conducts relational reasoning between grounded objects for placement
generation. ParaGon encodes all of those procedures into neural networks for
end-to-end training, which avoids annotating parsing and object reference
grounding labels. Our approach inherently integrates parsing-based methods into
a probabilistic, data-driven framework. It is data-efficient and generalizable
for learning compositional instructions, robust to noisy language inputs, and
adapts to the uncertainty of ambiguous instructions.
Related papers
- Energy-based Models are Zero-Shot Planners for Compositional Scene
Rearrangement [19.494104738436892]
We show that our framework can execute compositional instructions zero-shot in simulation and in the real world.
It outperforms language-to-action reactive policies and Large Language Model by a large margin, especially for long instructions that involve compositions of multiple concepts.
arXiv Detail & Related papers (2023-04-27T17:55:13Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Language Conditioned Spatial Relation Reasoning for 3D Object Grounding [87.03299519917019]
Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations.
We propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
arXiv Detail & Related papers (2022-11-17T16:42:39Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z) - Composing Pick-and-Place Tasks By Grounding Language [41.075844857146805]
We present a robot system that follows unconstrained language instructions to pick and place arbitrary objects.
Our approach infers objects and their relationships from input images and language expressions.
Results obtained using a real-world PR2 robot demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-02-16T11:29:09Z) - Few-shot Object Grounding and Mapping for Natural Language Robot
Instruction Following [15.896892723068932]
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects.
We introduce a few-shot language-conditioned object grounding method trained from augmented reality data.
We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output.
arXiv Detail & Related papers (2020-11-14T20:35:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.