Related papers: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

URL: http://arxiv.org/abs/2310.02239v3
Date: Fri, 15 Mar 2024 21:54:08 GMT
Title: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Authors: Kaizhi Zheng, Xuehai He, Xin Eric Wang,
Abstract summary: We introduce a novel interleaved vision-and-language generation method, centered around the concept of generative vokens. Our method is marked by a unique two-stage training strategy for description-free multimodal generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets.
Score: 22.802963850131306
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Related papers

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation [14.877355149519198]
We introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information.
arXiv Detail & Related papers (2025-02-25T15:42:34Z)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning. We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z)
Shapley Value-based Contrastive Alignment for Multimodal Information Extraction [17.04865437165252]
We introduce a new paradigm of Image-Context-Text interaction. We propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method. Our method significantly outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-07-25T08:15:43Z)
Harmonizing Visual Text Comprehension and Generation [31.605599298507293]
We present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. We propose Slide-LoRA, which aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-07-23T10:11:56Z)
Hierarchical Multi-modal Transformer for Cross-modal Long Document Classification [74.45521856327001]
How to classify long documents with hierarchical structure texts and embedding images is a new problem. We propose a novel approach called Hierarchical Multi-modal Transformer (HMT) for cross-modal long document classification. Our approach uses a multi-modal transformer and a dynamic multi-scale multi-modal transformer to model the complex relationships between image features, and the section and sentence features.
arXiv Detail & Related papers (2024-07-14T07:12:25Z)
SEED-Story: Multimodal Long Story Generation with Large Language Model [66.37077224696242]
SEED-Story is a novel method that leverages a Multimodal Large Language Model (MLLM) to generate extended multimodal stories. We propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner. We present a large-scale and high-resolution dataset named StoryStream for training our model and quantitatively evaluating the task of multimodal story generation in various aspects.
arXiv Detail & Related papers (2024-07-11T17:21:03Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation [46.085482021301516]
We propose DialogGen to align off-the-shelf MLLMs and T2I models to build a Multi-modal Interactive Dialogue System. It is composed of drawing prompt alignment, careful training data curation, and error correction. Our experiments on DialogGen and user study demonstrate the effectiveness of DialogGen compared with other State-of-the-Art models.
arXiv Detail & Related papers (2024-03-13T18:00:01Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations. Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover. Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z)
Fusion Models for Improved Visual Captioning [18.016295296424413]
This paper proposes a generic multimodal model fusion framework for caption generation and emendation. We employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM) with a visual captioning model, viz. Show, Attend, and Tell. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline.
arXiv Detail & Related papers (2020-10-28T21:55:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.