StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization
- URL: http://arxiv.org/abs/2412.07375v2
- Date: Tue, 17 Dec 2024 01:06:09 GMT
- Title: StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization
- Authors: Jinlu Zhang, Jiji Tang, Rongsheng Zhang, Tangjie Lv, Xiaoshuai Sun,
- Abstract summary: We propose a novel knowledge graph, namely Character Graph (textbfCG), which comprehensively represents various story-related knowledge.
We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (textbfC-CG), capable of consistent story visualization with rich text semantics.
- Score: 36.14275850149665
- License:
- Abstract: Story visualization has gained increasing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and text-semantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle this challenge, we propose a novel knowledge graph, namely Character Graph (\textbf{CG}), which comprehensively represents various story-related knowledge, including the characters, the attributes related to characters, and the relationship between characters. We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (\textbf{C-CG}), capable of consistent story visualization with rich text semantics. To further improve the multi-character generation performance, we incorporate knowledge-enhanced spatial guidance (\textbf{KE-SG}) into StoryWeaver to precisely inject character semantics into generation. To validate the effectiveness of our proposed method, extensive experiments are conducted using a new benchmark called TBC-Bench. The experiments confirm that our StoryWeaver excels not only in creating vivid visual story plots but also in accurately conveying character identities across various scenarios with considerable storage efficiency, \emph{e.g.}, achieving an average increase of +9.03\% DINO-I and +13.44\% CLIP-T. Furthermore, ablation experiments are conducted to verify the superiority of the proposed module. Codes and datasets are released at https://github.com/Aria-Zhangjl/StoryWeaver.
Related papers
- Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection [27.412361280397057]
We introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency.
Key innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector.
To facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images.
arXiv Detail & Related papers (2024-09-29T09:15:51Z) - Generating Visual Stories with Grounded and Coreferent Characters [63.07511918366848]
We present the first model capable of predicting visual stories with consistently grounded and coreferent character mentions.
Our model is finetuned on a new dataset which we build on top of the widely used VIST benchmark.
We also propose new evaluation metrics to measure the richness of characters and coreference in stories.
arXiv Detail & Related papers (2024-09-20T14:56:33Z) - Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models [79.21968152209193]
We introduce the NewEpisode benchmark to evaluate generative models' adaptability in generating new stories with fresh characters.
We propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics.
arXiv Detail & Related papers (2024-05-20T07:54:03Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Detecting and Grounding Important Characters in Visual Stories [18.870236356616907]
We introduce the VIST-Character dataset, which provides rich character-centric annotations.
Based on this dataset, we propose two new tasks: important character detection and character grounding in visual stories.
We develop simple, unsupervised models based on distributional similarity and pre-trained vision-and-language models.
arXiv Detail & Related papers (2023-03-30T18:24:06Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.