OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
- URL: http://arxiv.org/abs/2310.07749v2
- Date: Fri, 3 Nov 2023 17:30:41 GMT
- Title: OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation
- Authors: Jie An, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng
Liu, Lijuan Wang, Jiebo Luo
- Abstract summary: We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF.
For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences.
- Score: 151.57313182844936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work investigates a challenging task named open-domain interleaved
image-text generation, which generates interleaved texts and images following
an input query. We propose a new interleaved generation framework based on
prompting large-language models (LLMs) and pre-trained text-to-image (T2I)
models, namely OpenLEAF. In OpenLEAF, the LLM generates textual descriptions,
coordinates T2I models, creates visual prompts for generating images, and
incorporates global contexts into the T2I models. This global context improves
the entity and style consistencies of images in the interleaved generation. For
model assessment, we first propose to use large multi-modal models (LMMs) to
evaluate the entity and style consistencies of open-domain interleaved
image-text sequences. According to the LMM evaluation on our constructed
evaluation set, the proposed interleaved generation framework can generate
high-quality image-text content for various domains and applications, such as
how-to question answering, storytelling, graphical story rewriting, and
webpage/poster generation tasks. Moreover, we validate the effectiveness of the
proposed LMM evaluation technique with human assessment. We hope our proposed
framework, benchmark, and LMM evaluation could help establish the intriguing
interleaved image-text generation task.
Related papers
- Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional
Text-to-image Generation [62.71574695256264]
T2I-CompBench is a comprehensive benchmark for open-world compositional text-to-image generation.
We propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation.
We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS) to boost the compositional text-to-image generation abilities.
arXiv Detail & Related papers (2023-07-12T17:59:42Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.