MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
- URL: http://arxiv.org/abs/2403.03194v1
- Date: Tue, 5 Mar 2024 18:31:28 GMT
- Title: MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
- Authors: Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin
Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour
- Abstract summary: Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data.
We introduce textbfMultimodal textbfAugmented textbfGenerative textbfImages textbfDialogues (MAGID) to augment text-only dialogues with diverse and high-quality images.
- Score: 30.72744231027204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Development of multimodal interactive systems is hindered by the lack of
rich, multimodal (text, images) conversational data, which is needed in large
quantities for LLMs. Previous approaches augment textual dialogues with
retrieved images, posing privacy, diversity, and quality constraints. In this
work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative
\textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only
dialogues with diverse and high-quality images. Subsequently, a diffusion model
is applied to craft corresponding images, ensuring alignment with the
identified text. Finally, MAGID incorporates an innovative feedback loop
between an image description generation module (textual LLM) and image quality
modules (addressing aesthetics, image-text matching, and safety), that work in
tandem to generate high-quality and multi-modal dialogues. We compare MAGID to
other SOTA baselines on three dialogue datasets, using automated and human
evaluation. Our results show that MAGID is comparable to or better than
baselines, with significant improvements in human evaluation, especially
against retrieval baselines where the image database is small.
Related papers
- SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories.
We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z) - CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation [20.106207598099363]
We introduce CoMM, a high-quality dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content.
CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling.
Various quality evaluation metrics are designed to prove the high quality of the filtered dataset.
arXiv Detail & Related papers (2024-06-15T01:27:58Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - ZRIGF: An Innovative Multimodal Framework for Zero-Resource
Image-Grounded Dialogue Generation [17.310200022696016]
ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training.
Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses.
arXiv Detail & Related papers (2023-08-01T09:28:36Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset [18.449076451976236]
In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset.
In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments.
Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset.
arXiv Detail & Related papers (2022-12-08T07:29:07Z) - Constructing Multi-Modal Dialogue Dataset by Replacing Text with
Semantically Relevant Images [17.076424447172297]
This paper proposes a 45k multi-modal dialogue dataset created with minimal human intervention.
Our method to create such a dataset consists of (1) preparing and pre-processing text dialogue datasets, (2) creating image-mixed dialogues by using a text-to-image replacement technique, and (3) employing a contextual-similarity-based filtering step.
arXiv Detail & Related papers (2021-07-19T08:44:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.