TaleCrafter: Interactive Story Visualization with Multiple Characters
- URL: http://arxiv.org/abs/2305.18247v2
- Date: Tue, 30 May 2023 08:54:42 GMT
- Title: TaleCrafter: Interactive Story Visualization with Multiple Characters
- Authors: Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin
Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang
- Abstract summary: This paper proposes a system for generic interactive story visualization.
It is capable of handling multiple novel characters and supporting the editing of layout and local structure.
The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
- Score: 49.14122401339003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate Story visualization requires several necessary elements, such as
identity consistency across frames, the alignment between plain text and visual
content, and a reasonable layout of objects in images. Most previous works
endeavor to meet these requirements by fitting a text-to-image (T2I) model on a
set of videos in the same style and with the same characters, e.g., the
FlintstonesSV dataset. However, the learned T2I models typically struggle to
adapt to new characters, scenes, and styles, and often lack the flexibility to
revise the layout of the synthesized images. This paper proposes a system for
generic interactive story visualization, capable of handling multiple novel
characters and supporting the editing of layout and local structure. It is
developed by leveraging the prior knowledge of large language and T2I models,
trained on massive corpora. The system comprises four interconnected
components: story-to-prompt generation (S2P), text-to-layout generation (T2L),
controllable text-to-image generation (C-T2I), and image-to-video animation
(I2V). First, the S2P module converts concise story information into detailed
prompts required for subsequent stages. Next, T2L generates diverse and
reasonable layouts based on the prompts, offering users the ability to adjust
and refine the layout to their preference. The core component, C-T2I, enables
the creation of images guided by layouts, sketches, and actor-specific
identifiers to maintain consistency and detail across visualizations. Finally,
I2V enriches the visualization process by animating the generated images.
Extensive experiments and a user study are conducted to validate the
effectiveness and flexibility of interactive editing of the proposed system.
Related papers
- Layout-and-Retouch: A Dual-stage Framework for Improving Diversity in Personalized Image Generation [40.969861849933444]
We propose a novel P-T2I method called Layout-and-Retouch, consisting of two stages: 1) layout generation and 2) retouch.
In the first stage, our step-blended inference utilizes the inherent sample diversity of vanilla T2I models to produce diversified layout images.
In the second stage, multi-source attention swaps the context image from the first stage with the reference image, leveraging the structure from the context image and extracting visual features from the reference image.
arXiv Detail & Related papers (2024-07-13T05:28:45Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Character-Centric Story Visualization via Visual Planning and Token
Alignment [53.44760407148918]
Story visualization advances the traditional text-to-image generation by enabling multiple image generation based on a complete story.
Key challenge of consistent story visualization is to preserve characters that are essential in stories.
We propose to adapt a recent work that augments Vector-Quantized Variational Autoencoders with a text-tovisual-token architecture.
arXiv Detail & Related papers (2022-10-16T06:50:39Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.