YouRefIt: Embodied Reference Understanding with Language and Gesture
- URL: http://arxiv.org/abs/2109.03413v1
- Date: Wed, 8 Sep 2021 03:27:32 GMT
- Title: YouRefIt: Embodied Reference Understanding with Language and Gesture
- Authors: Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Song-Chun Zhu, Tao Gao,
Yixin Zhu, Siyuan Huang
- Abstract summary: We study the understanding of embodied reference.
One agent uses both language and gesture to refer to an object to another agent in a shared physical environment.
Crowd-sourced YouRefIt dataset contains 4,195 unique reference clips in 432 indoor scenes.
- Score: 95.93218436323481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the understanding of embodied reference: One agent uses both
language and gesture to refer to an object to another agent in a shared
physical environment. Of note, this new visual task requires understanding
multimodal cues with perspective-taking to identify which object is being
referred to. To tackle this problem, we introduce YouRefIt, a new crowd-sourced
dataset of embodied reference collected in various physical scenes; the dataset
contains 4,195 unique reference clips in 432 indoor scenes. To the best of our
knowledge, this is the first embodied reference dataset that allows us to study
referring expressions in daily physical scenes to understand referential
behavior, human communication, and human-robot interaction. We further devise
two benchmarks for image-based and video-based embodied reference
understanding. Comprehensive baselines and extensive experiments provide the
very first result of machine perception on how the referring expressions and
gestures affect the embodied reference understanding. Our results provide
essential evidence that gestural cues are as critical as language cues in
understanding the embodied reference.
Related papers
- J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution [22.911318874589448]
In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views.
We propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3)
Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home.
arXiv Detail & Related papers (2024-03-28T09:32:43Z) - Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT)
Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.
To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - Phrase-Based Affordance Detection via Cyclic Bilateral Interaction [17.022853987801877]
We explore to perceive affordance from a vision-language perspective and consider the challenging phrase-based affordance detection problem.
We propose a cyclic bilateral consistency enhancement network (CBCE-Net) to align language and vision features progressively.
Specifically, the presented CBCE-Net consists of a mutual guided vision-language module that updates the common features of vision and language in a progressive manner, and a cyclic interaction module (CIM) that facilitates the perception of possible interaction with objects in a cyclic manner.
arXiv Detail & Related papers (2022-02-24T13:02:27Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - ClawCraneNet: Leveraging Object-level Relation for Text-based Video
Segmentation [47.7867284770227]
Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos.
We introduce a novel top-down approach by imitating how we human segment an object with the language guidance.
Our method outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2021-03-19T09:31:08Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.