Related papers: DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

URL: http://arxiv.org/abs/2101.00151v1
Date: Fri, 1 Jan 2021 03:20:22 GMT
Title: DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Authors: Hung Le and Chinnadhurai Sankar and Seungwhan Moon and Ahmad Beirami and Alborz Geramifard and Satwik Kottur
Abstract summary: A video-grounded dialogue system is required to understand both dialogue and video. Existing benchmarks do not have enough annotations to help analyze dialogue systems. We present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues.
Score: 30.930757279692163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem involving complex multimodal and temporal inputs, and studying them independently is hard with existing datasets. Existing benchmarks do not have enough annotations to help analyze dialogue systems and understand their linguistic and visual reasoning capability and limitations in isolation. These benchmarks are also not explicitly designed to minimize biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning each question requires, including cross-turn video interval tracking and dialogue object tracking. We use our dataset to analyze several dialogue system approaches, providing interesting insights into their abilities and limitations. In total, the dataset contains $10$ instances of $10$-round dialogues for each of $\sim11k$ synthetic videos, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset will be made public.

Related papers

Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos [3.0758169771529693]
We introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns.<n>A conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.
arXiv Detail & Related papers (2025-06-11T17:23:35Z)
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z)
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics. We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z)
CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation. It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources. To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z)
Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z)
HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z)
$C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues [97.25466640240619]
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. We propose a novel approach of Compositional Counterfactual Contrastive Learning to develop contrastive training between factual and counterfactual samples in video-grounded dialogues.
arXiv Detail & Related papers (2021-06-16T16:05:27Z)
OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset. OpenViDial contains a total number of 1.1 million dialogue turns. We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z)
RiSAWOZ: A Large-Scale Multi-Domain Wizard-of-Oz Dataset with Rich Semantic Annotations for Task-Oriented Dialogue Modeling [35.75880078666584]
RiSAWOZ is a large-scale multi-domain Chinese Wizard-of-Oz dataset with Rich Semantic s. It contains 11.2K human-to-human (H2H) multi-turn semantically annotated dialogues, with more than 150K utterances spanning over 12 domains.
arXiv Detail & Related papers (2020-10-17T08:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.