MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
- URL: http://arxiv.org/abs/2403.03194v2
- Date: Thu, 03 Oct 2024 00:32:22 GMT
- Title: MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
- Authors: Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour,
- Abstract summary: Multimodal Augmented Generative Images Dialogues (MAGID) is a framework to augment text-only dialogues with diverse and high-quality images.
Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation.
- Score: 29.737965533532577
- License:
- Abstract: Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both.
Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach.
We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z) - SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories.
We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System.
It is composed of drawing prompt alignment, careful training data curation, and error correction.
Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z) - ZRIGF: An Innovative Multimodal Framework for Zero-Resource
Image-Grounded Dialogue Generation [17.310200022696016]
ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training.
Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses.
arXiv Detail & Related papers (2023-08-01T09:28:36Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset [18.449076451976236]
In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset.
In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments.
Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset.
arXiv Detail & Related papers (2022-12-08T07:29:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.