Related papers: MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

URL: http://arxiv.org/abs/2403.03194v1
Date: Tue, 5 Mar 2024 18:31:28 GMT
Title: MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
Authors: Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour
Abstract summary: Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data. We introduce textbfMultimodal textbfAugmented textbfGenerative textbfImages textbfDialogues (MAGID) to augment text-only dialogues with diverse and high-quality images.
Score: 30.72744231027204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce \textbf{M}ultimodal \textbf{A}ugmented \textbf{G}enerative \textbf{I}mages \textbf{D}ialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images. Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

Related papers

Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both. Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z)
SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System. It is composed of drawing prompt alignment, careful training data curation, and error correction. Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z)
ZRIGF: An Innovative Multimodal Framework for Zero-Resource Image-Grounded Dialogue Generation [17.310200022696016]
ZRIGF implements a two-stage learning strategy, comprising contrastive pre-training and generative pre-training. Comprehensive experiments conducted on both text-based and image-grounded dialogue datasets demonstrate ZRIGF's efficacy in generating contextually pertinent and informative responses.
arXiv Detail & Related papers (2023-08-01T09:28:36Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue. We propose a two-stage approach to automatically construct a multi-modal dialogue dataset. In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image. In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z)
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset [18.449076451976236]
In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments. Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset.
arXiv Detail & Related papers (2022-12-08T07:29:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.