InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
Multimodal and Language Models
- URL: http://arxiv.org/abs/2312.13503v1
- Date: Thu, 21 Dec 2023 00:44:45 GMT
- Title: InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
Multimodal and Language Models
- Authors: Bingbing Wen, Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Bill Howe,
Lijuan Wang
- Abstract summary: We build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round.
For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3)
- Score: 123.1441379479263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we build a visual dialogue dataset, named InfoVisDial, which
provides rich informative answers in each round even with external knowledge
related to the visual content. Different from existing datasets where the
answer is compact and short, InfoVisDial contains long free-form answers with
rich information in each round of dialogue. For effective data collection, the
key idea is to bridge the large-scale multimodal model (e.g., GIT) and the
language models (e.g., GPT-3). GIT can describe the image content even with
scene text, while GPT-3 can generate informative dialogue based on the image
description and appropriate prompting techniques. With such automatic pipeline,
we can readily generate informative visual dialogue data at scale. Then, we ask
human annotators to rate the generated dialogues to filter the low-quality
conversations.Human analyses show that InfoVisDial covers informative and
diverse dialogue topics: $54.4\%$ of the dialogue rounds are related to image
scene texts, and $36.7\%$ require external knowledge. Each round's answer is
also long and open-ended: $87.3\%$ of answers are unique with an average length
of $8.9$, compared with $27.37\%$ and $2.9$ in VisDial. Last, we propose a
strong baseline by adapting the GIT model for the visual dialogue task and
fine-tune the model on InfoVisDial. Hopefully, our work can motivate more
effort on this direction.
Related papers
- Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos [3.0758169771529693]
We introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns.<n>A conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information.
arXiv Detail & Related papers (2025-06-11T17:23:35Z) - DialogStudio: Towards Richest and Most Diverse Unified Dataset
Collection for Conversational AI [92.29874802394167]
DialogStudio is the largest and most diverse collection of dialogue datasets.
Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues.
arXiv Detail & Related papers (2023-07-19T17:57:53Z) - A Unified Framework for Slot based Response Generation in a Multimodal
Dialogue System [25.17100881568308]
Natural Language Understanding (NLU) and Natural Language Generation (NLG) are the two critical components of every conversational system.
We propose an end-to-end framework with the capability to extract necessary slot values from the utterance.
We employ a multimodal hierarchical encoder using pre-trained DialoGPT to provide a stronger context for both tasks.
arXiv Detail & Related papers (2023-05-27T10:06:03Z) - TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z) - Dialog Inpainting: Turning Documents into Dialogs [12.131506050808207]
We produce two datasets totalling 19 million diverse information-seeking dialogs.
Human raters judge the answer adequacy and conversationality of WikiDialog to be as good or better than existing manually-collected datasets.
arXiv Detail & Related papers (2022-05-18T16:58:50Z) - DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded
Dialogue [30.930757279692163]
A video-grounded dialogue system is required to understand both dialogue and video.
Existing benchmarks do not have enough annotations to help analyze dialogue systems.
We present a diagnostic dataset that can test a range of reasoning abilities on videos and dialogues.
arXiv Detail & Related papers (2021-01-01T03:20:22Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - History for Visual Dialog: Do we really need it? [55.642625058602924]
We show that co-attention models which explicitly encode dialog history outperform models that don't.
We also expose shortcomings of the crowd-sourcing dataset collection procedure.
arXiv Detail & Related papers (2020-05-08T14:58:09Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.