Masked Generative Story Transformer with Character Guidance and Caption
Augmentation
- URL: http://arxiv.org/abs/2403.08502v1
- Date: Wed, 13 Mar 2024 13:10:20 GMT
- Title: Masked Generative Story Transformer with Character Guidance and Caption
Augmentation
- Authors: Christos Papadimitriou, Giorgos Filandrianos, Maria Lymperaiou,
Giorgos Stamou
- Abstract summary: Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences.
Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately.
We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
- Score: 2.1392064955842023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Story Visualization (SV) is a challenging generative vision task, that
requires both visual quality and consistency between different frames in
generated image sequences. Previous approaches either employ some kind of
memory mechanism to maintain context throughout an auto-regressive generation
of the image sequence, or model the generation of the characters and their
background separately, to improve the rendering of characters. On the contrary,
we embrace a completely parallel transformer-based approach, exclusively
relying on Cross-Attention with past and future captions to achieve
consistency. Additionally, we propose a Character Guidance technique to focus
on the generation of characters in an implicit manner, by forming a combination
of text-conditional and character-conditional logits in the logit space. We
also employ a caption-augmentation technique, carried out by a Large Language
Model (LLM), to enhance the robustness of our approach. The combination of
these methods culminates into state-of-the-art (SOTA) results over various
metrics in the most prominent SV benchmark (Pororo-SV), attained with
constraint resources while achieving superior computational complexity compared
to previous arts. The validity of our quantitative results is supported by a
human survey.
Related papers
- Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - STAR: Scale-wise Text-to-image generation via Auto-Regressive representations [40.66170627483643]
We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm.
We show that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality.
arXiv Detail & Related papers (2024-06-16T03:45:45Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Masked and Adaptive Transformer for Exemplar Based Image Translation [16.93344592811513]
Cross-domain semantic matching is challenging.
We propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence.
We devise a novel contrastive style learning method, for acquire quality-discriminative style representations.
arXiv Detail & Related papers (2023-03-30T03:21:14Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer [40.04085054791994]
We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
arXiv Detail & Related papers (2022-06-09T12:25:24Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.