StoryGPT-V: Large Language Models as Consistent Story Visualizers
- URL: http://arxiv.org/abs/2312.02252v2
- Date: Wed, 13 Dec 2023 11:21:32 GMT
- Title: StoryGPT-V: Large Language Models as Consistent Story Visualizers
- Authors: Xiaoqian Shen and Mohamed Elhoseiny
- Abstract summary: generative models have demonstrated impressive capabilities in generating realistic and visually pleasing images grounded on textual prompts.
Yet, the emerging Large Language Model (LLM) showcases robust reasoning abilities to navigate through ambiguous references.
We introduce textbfStoryGPT-V, which leverages the merits of the latent diffusion (LDM) and LLM to produce images with consistent and high-quality characters.
- Score: 39.790319429455856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent generative models have demonstrated impressive capabilities in
generating realistic and visually pleasing images grounded on textual prompts.
Nevertheless, a significant challenge remains in applying these models for the
more intricate task of story visualization. Since it requires resolving
pronouns (he, she, they) in the frame descriptions, i.e., anaphora resolution,
and ensuring consistent characters and background synthesis across frames. Yet,
the emerging Large Language Model (LLM) showcases robust reasoning abilities to
navigate through ambiguous references and process extensive sequences.
Therefore, we introduce \textbf{StoryGPT-V}, which leverages the merits of the
latent diffusion (LDM) and LLM to produce images with consistent and
high-quality characters grounded on given story descriptions. First, we train a
character-aware LDM, which takes character-augmented semantic embedding as
input and includes the supervision of the cross-attention map using character
segmentation masks, aiming to enhance character generation accuracy and
faithfulness. In the second stage, we enable an alignment between the output of
LLM and the character-augmented embedding residing in the input space of the
first-stage model. This harnesses the reasoning ability of LLM to address
ambiguous references and the comprehension capability to memorize the context.
We conduct comprehensive experiments on two visual story visualization
benchmarks. Our model reports superior quantitative results and consistently
generates accurate characters of remarkable quality with low memory
consumption. Our code will be made publicly available.
Related papers
- Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Improving Visual Storytelling with Multimodal Large Language Models [1.325953054381901]
This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs)
We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements.
Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities.
arXiv Detail & Related papers (2024-07-02T18:13:55Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.