IMAD: IMage-Augmented multi-modal Dialogue
- URL: http://arxiv.org/abs/2305.10512v2
- Date: Sat, 16 Dec 2023 10:18:21 GMT
- Title: IMAD: IMage-Augmented multi-modal Dialogue
- Authors: Viktor Moskvoretskii, Anton Frolov, Denis Kuznetsov
- Abstract summary: This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
- Score: 0.043847653914745384
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Currently, dialogue systems have achieved high performance in processing
text-based communication. However, they have not yet effectively incorporated
visual information, which poses a significant challenge. Furthermore, existing
models that incorporate images in dialogue generation focus on discussing the
image itself. Our proposed approach presents a novel perspective on multi-modal
dialogue systems, which interprets the image in the context of the dialogue. By
doing so, we aim to expand the capabilities of current dialogue systems and
transition them from single modality (text) to multi-modality. However, there
is a lack of validated English datasets that contain both images and dialogue
contexts for this task. Thus, we propose a two-stage approach to automatically
construct a multi-modal dialogue dataset. In the first stage, we utilize
text-to-image similarity and sentence similarity to identify which utterances
could be replaced with an image. In the second stage, we replace those
utterances by selecting a subset of relevant images and filtering them with a
visual question answering model. We used this approach, along with additional
labeling, to create the IMage Augmented multi-modal Dialogue dataset (IMAD),
which can serve as a validated dataset for this task. Furthermore, we propose a
baseline model trained on this dataset, which outperforms model trained on the
same data without images and BlenderBot.
Related papers
- BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets [29.737965533532577]
Multimodal Augmented Generative Images Dialogues (MAGID) is a framework to augment text-only dialogues with diverse and high-quality images.
Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation.
arXiv Detail & Related papers (2024-03-05T18:31:28Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset [18.449076451976236]
In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset.
In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments.
Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset.
arXiv Detail & Related papers (2022-12-08T07:29:07Z) - Manual-Guided Dialogue for Flexible Conversational Agents [84.46598430403886]
How to build and use dialogue data efficiently, and how to deploy models in different domains at scale can be critical issues in building a task-oriented dialogue system.
We propose a novel manual-guided dialogue scheme, where the agent learns the tasks from both dialogue and manuals.
Our proposed scheme reduces the dependence of dialogue models on fine-grained domain ontology, and makes them more flexible to adapt to various domains.
arXiv Detail & Related papers (2022-08-16T08:21:12Z) - Constructing Multi-Modal Dialogue Dataset by Replacing Text with
Semantically Relevant Images [17.076424447172297]
This paper proposes a 45k multi-modal dialogue dataset created with minimal human intervention.
Our method to create such a dataset consists of (1) preparing and pre-processing text dialogue datasets, (2) creating image-mixed dialogues by using a text-to-image replacement technique, and (3) employing a contextual-similarity-based filtering step.
arXiv Detail & Related papers (2021-07-19T08:44:11Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z) - Paraphrase Augmented Task-Oriented Dialog Generation [68.1790912977053]
We propose a paraphrase augmented response generation (PARG) framework that jointly trains a paraphrase model and a response generation model.
We also design a method to automatically construct paraphrase training data set based on dialog state and dialog act labels.
arXiv Detail & Related papers (2020-04-16T05:12:36Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.