DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded
Dialogue
- URL: http://arxiv.org/abs/2101.00151v1
- Date: Fri, 1 Jan 2021 03:20:22 GMT
- Title: DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded
Dialogue
- Authors: Hung Le and Chinnadhurai Sankar and Seungwhan Moon and Ahmad Beirami
and Alborz Geramifard and Satwik Kottur
- Abstract summary: A video-grounded dialogue system is required to understand both dialogue and video.
Existing benchmarks do not have enough annotations to help analyze dialogue systems.
We present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues.
- Score: 30.930757279692163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A video-grounded dialogue system is required to understand both dialogue,
which contains semantic dependencies from turn to turn, and video, which
contains visual cues of spatial and temporal scene variations. Building such
dialogue systems is a challenging problem involving complex multimodal and
temporal inputs, and studying them independently is hard with existing
datasets. Existing benchmarks do not have enough annotations to help analyze
dialogue systems and understand their linguistic and visual reasoning
capability and limitations in isolation. These benchmarks are also not
explicitly designed to minimize biases that models can exploit without actual
reasoning. To address these limitations, in this paper, we present a diagnostic
dataset that can test a range of reasoning abilities on videos and dialogues.
The dataset is designed to contain minimal biases and has detailed annotations
for the different types of reasoning each question requires, including
cross-turn video interval tracking and dialogue object tracking. We use our
dataset to analyze several dialogue system approaches, providing interesting
insights into their abilities and limitations. In total, the dataset contains
$10$ instances of $10$-round dialogues for each of $\sim11k$ synthetic videos,
resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our
code and dataset will be made public.
Related papers
- Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos [3.0758169771529693]
We introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns.<n>A conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.
arXiv Detail & Related papers (2025-06-11T17:23:35Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on
Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables.
The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z) - $C^3$: Compositional Counterfactual Contrastive Learning for
Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context.
Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available.
We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich
Semantic Annotations for Task-Oriented Dialogue Modeling [35.75880078666584]
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic s.
It contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
arXiv Detail & Related papers (2020-10-17T08:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.