InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
Interactions
- URL: http://arxiv.org/abs/2310.12147v1
- Date: Wed, 18 Oct 2023 17:57:05 GMT
- Title: InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot
Interactions
- Authors: Hanbo Zhang and Jie Xu and Yuchen Mo and Tao Kong
- Abstract summary: We present a large-scale dataset, invig, for interactive visual grounding under language ambiguity.
Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues.
To the best of our knowledge, the invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding.
- Score: 23.296139146133573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ambiguity is ubiquitous in human communication. Previous approaches in
Human-Robot Interaction (HRI) have often relied on predefined interaction
templates, leading to reduced performance in realistic and open-ended
scenarios. To address these issues, we present a large-scale dataset, \invig,
for interactive visual grounding under language ambiguity. Our dataset
comprises over 520K images accompanied by open-ended goal-oriented
disambiguation dialogues, encompassing millions of object instances and
corresponding question-answer pairs. Leveraging the \invig dataset, we conduct
extensive studies and propose a set of baseline solutions for end-to-end
interactive visual disambiguation and grounding, achieving a 45.6\% success
rate during validation. To the best of our knowledge, the \invig dataset is the
first large-scale dataset for resolving open-ended interactive visual
grounding, presenting a practical yet highly challenging benchmark for
ambiguity-aware HRI. Codes and datasets are available at:
\href{https://openivg.github.io}{https://openivg.github.io}.
Related papers
- Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models [5.541130887628606]
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME)
We introduce a novel data generation method that utilizes Large Vision Language Models (LVLMs) to annotate contact maps which guide test-time optimization to produce paired image and pseudo-ground truth meshes.
This methodology not only alleviates the annotation burden but also enables the assembly of a comprehensive dataset specifically tailored for close interactions in HME.
arXiv Detail & Related papers (2024-10-01T01:14:24Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - PRESTO: A Multilingual Dataset for Parsing Realistic Task-Oriented
Dialogs [39.58414649004708]
PRESTO is a dataset of over 550K contextual multilingual conversations between humans and virtual assistants.
It contains challenges that occur in real-world NLU tasks such as disfluencies, code-switching, and revisions.
Our mT5 model based baselines demonstrate that the conversational phenomenon present in PRESTO are challenging to model.
arXiv Detail & Related papers (2023-03-15T21:51:13Z) - Exploiting Scene Graphs for Human-Object Interaction Detection [81.49184987430333]
Human-Object Interaction (HOI) detection is a fundamental visual task aiming at localizing and recognizing interactions between humans and objects.
We propose a novel method to exploit this information, through the scene graph, for the Human-Object Interaction (SG2HOI) detection task.
Our method, SG2HOI, incorporates the SG information in two ways: (1) we embed a scene graph into a global context clue, serving as the scene-specific environmental context; and (2) we build a relation-aware message-passing module to gather relationships from objects' neighborhood and transfer them into interactions.
arXiv Detail & Related papers (2021-08-19T09:40:50Z) - LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal
Networks for HOI in videos [13.25502885135043]
Analyzing the interactions between humans and objects from a video includes identification of relationships between humans and the objects present in the video.
We present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture truth at multiple granularities in a video.
We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO.
arXiv Detail & Related papers (2020-12-17T05:44:07Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.