Character-Centric Story Visualization via Visual Planning and Token
Alignment
- URL: http://arxiv.org/abs/2210.08465v3
- Date: Thu, 20 Oct 2022 15:53:40 GMT
- Title: Character-Centric Story Visualization via Visual Planning and Token
Alignment
- Authors: Hong Chen, Rujun Han, Te-Lin Wu, Hideki Nakayama and Nanyun Peng
- Abstract summary: Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
- Score: 53.44760407148918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Story visualization advances the traditional text-to-image generation by
enabling multiple image generation based on a complete story. This task
requires machines to 1) understand long text inputs and 2) produce a globally
consistent image sequence that illustrates the contents of the story. A key
challenge of consistent story visualization is to preserve characters that are
essential in stories. To tackle the challenge, we propose to adapt a recent
work that augments Vector-Quantized Variational Autoencoders (VQ-VAE) with a
text-tovisual-token (transformer) architecture. Specifically, we modify the
text-to-visual-token module with a two-stage framework: 1) character token
planning model that predicts the visual tokens for characters only; 2) visual
token completion model that generates the remaining visual token sequence,
which is sent to VQ-VAE for finalizing image generations. To encourage
characters to appear in the images, we further train the two-stage framework
with a character-token alignment objective. Extensive experiments and
evaluations demonstrate that the proposed method excels at preserving
characters and can produce higher quality image sequences compared with the
strong baselines. Codes can be found in https://github.com/sairin1202/VP-CSV
Related papers
- SA$^2$VP: Spatially Aligned-and-Adapted Visual Prompt [59.280491260635266]
Methods for visual prompt tuning follow the sequential modeling paradigm stemming from NLP.
Mymodel model learns a two-dimensional prompt token map with equal (or scaled) size to the image token map.
Our model can conduct individual prompting for different image tokens in a fine-grained manner.
arXiv Detail & Related papers (2023-12-16T08:23:43Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - TaleCrafter: Interactive Story Visualization with Multiple Characters [49.14122401339003]
This paper proposes a system for generic interactive story visualization.
It is capable of handling multiple novel characters and supporting the editing of layout and local structure.
The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
arXiv Detail & Related papers (2023-05-29T17:11:39Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Vision Transformer Based Model for Describing a Set of Images as a Story [26.717033245063092]
We propose a novel Vision Transformer Based Model for describing a set of images as a story.
The proposed method extracts the distinct features of the input images using a Vision Transformer (ViT)
The performance of our proposed model is evaluated using the Visual Story-Telling dataset (VIST)
arXiv Detail & Related papers (2022-10-06T09:01:50Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.