OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts
- URL: http://arxiv.org/abs/2012.15015v1
- Date: Wed, 30 Dec 2020 03:02:50 GMT
- Title: OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts
- Authors: Yuxian Meng, Shuhe Wang, Qinghong Han, Xiaofei Sun, Fei Wu, Rui Yan
and Jiwei Li
- Abstract summary: We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
- Score: 35.57757367869986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When humans converse, what a speaker will say next significantly depends on
what he sees. Unfortunately, existing dialogue models generate dialogue
utterances only based on preceding textual contexts, and visual contexts are
rarely considered. This is due to a lack of a large-scale multi-module dialogue
dataset with utterances paired with visual contexts. In this paper, we release
{\bf OpenViDial}, a large-scale multi-module dialogue dataset. The dialogue
turns and visual contexts are extracted from movies and TV series, where each
dialogue turn is paired with the corresponding visual context in which it takes
place. OpenViDial contains a total number of 1.1 million dialogue turns, and
thus 1.1 million visual contexts stored in images. Based on this dataset, we
propose a family of encoder-decoder models leveraging both textual and visual
contexts, from coarse-grained image features extracted from CNNs to
fine-grained object features extracted from Faster R-CNNs. We observe that
visual information significantly improves dialogue generation qualities,
verifying the necessity of integrating multi-modal features for dialogue
learning. Our work marks an important step towards large-scale multi-modal
dialogue learning.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Multi-turn Dialogue Comprehension from a Topic-aware Perspective [70.37126956655985]
This paper proposes to model multi-turn dialogues from a topic-aware perspective.
We use a dialogue segmentation algorithm to split a dialogue passage into topic-concentrated fragments in an unsupervised way.
We also present a novel model, Topic-Aware Dual-Attention Matching (TADAM) Network, which takes topic segments as processing elements.
arXiv Detail & Related papers (2023-09-18T11:03:55Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic
Understanding with Scene and Topic Transitions [47.94531693056304]
Video-grounded dialogue understanding is a challenging problem that requires machine to perceive, parse and reason over situated semantics.
We present Video-grounded Scene&Topic AwaRe dialogue dataset, a large scale video-grounded dialogue understanding dataset based on 395 TV series.
arXiv Detail & Related papers (2023-05-30T05:40:37Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - Multimodal Dialogue State Tracking [97.25466640240619]
Video-Dialogue Transformer Network (VDTN) learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states.
arXiv Detail & Related papers (2022-06-16T03:18:42Z) - OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset
with Visual Contexts [20.37658842432543]
We release OpenViDial 2.0, a larger-scale open-domain multi-modal dialogue dataset.
OpenViDial 2.0 contains a total number of 5.6 million dialogue turns extracted from either movies or TV series.
arXiv Detail & Related papers (2021-09-27T02:10:29Z) - MMChat: Multi-Modal Chat Dataset on Social Media [8.904627457711683]
MMChat is a large scale multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues)
Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media.
We develop a benchmark model to address this issue in dialogue generation tasks by adapting the attention routing mechanism on image features.
arXiv Detail & Related papers (2021-08-16T15:27:49Z) - Modeling Text-visual Mutual Dependency for Multi-modal Dialog Generation [35.45552689723718]
We propose frameworks to resolve a specific case of multi-modal dialog generation in the real world.
Specifically, we propose to model the mutual dependency between text-visual features.
We observe significant performance boosts over vanilla models when the mutual dependency between text and visual features is modeled.
arXiv Detail & Related papers (2021-05-30T07:20:28Z) - Learning Reasoning Paths over Semantic Graphs for Video-grounded
Dialogues [73.04906599884868]
We propose a novel framework of Reasoning Paths in Dialogue Context (PDC)
PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer.
Our model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer.
arXiv Detail & Related papers (2021-03-01T07:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.