OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
- URL: http://arxiv.org/abs/2310.07749v2
- Date: Fri, 3 Nov 2023 17:30:41 GMT
- Title: OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
- Authors: Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng
Liu, Lijuan Wang, Jiebo Luo
- Abstract summary: We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF.
For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences.
- Score: 151.57313182844936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work investigates a challenging task named open-domain interleaved
image-text generation, which generates interleaved texts and images following
an input query. We propose a new interleaved generation framework based on
prompting large-language models (LLMs) and pre-trained text-to-image (T2I)
models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions,
coordinates T2I models, creates visual prompts for generating images, and
incorporates global contexts into the T2I models. This global context improves
the entity and style consistencies of images in the interleaved generation. For
model assessment, we first propose to use large multi-modal models (LMMs) to
evaluate the entity and style consistencies of open-domain interleaved
image-text sequences. According to the LMM evaluation on our constructed
evaluation set, the proposed interleaved generation framework can generate
high-quality image-text content for various domains and applications, such as
how-to question answering, storytelling, graphical story rewriting, and
webpage/poster generation tasks. Moreover, we validate the effectiveness of the
proposed LMM evaluation technique with human assessment. We hope our proposed
framework, benchmark, and LMM evaluation could help establish the intriguing
interleaved image-text generation task.
Related papers
- Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.
ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.
In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional
Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation.
We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation.
We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.