Related papers: Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

URL: http://arxiv.org/abs/2409.19624v1
Date: Sun, 29 Sep 2024 09:15:51 GMT
Title: Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection
Authors: Yuhang Ma, Wenting Xu, Chaoyi Zhao, Keqiang Sun, Qinfeng Jin, Zeng Zhao, Changjie Fan, Zhipeng Hu,
Abstract summary: We introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency. Key innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. To facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images.
Score: 27.412361280397057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

Related papers

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement [54.199726425201895]
Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks.<n>Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP.<n>We propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration.
arXiv Detail & Related papers (2026-02-21T08:24:42Z)
DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z)
IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation [75.09818147405898]
IdentityStory is a framework for human-centric story generation that ensures consistent character identity across sequential images.<n>By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery and Re-denoising Identity Injection.
arXiv Detail & Related papers (2025-12-29T14:54:44Z)
Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models [3.3454373538792552]
We propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations.<n>The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences.
arXiv Detail & Related papers (2025-06-23T17:59:29Z)
Identity-Preserving Text-to-Image Generation via Dual-Level Feature Decoupling and Expert-Guided Fusion [35.67333978414322]
We propose a novel framework that improves the separation of identity-related and identity-unrelated features.<n>Our framework consists of two key components: an Implicit-Explicit foreground-background Decoupling Module and a Feature Fusion Module.
arXiv Detail & Related papers (2025-05-28T13:40:46Z)
PIDiff: Image Customization for Personalized Identities with Diffusion Models [13.726194815227464]
We propose a novel fine-tuning-based diffusion model for personalized identities text-to-image generation, named PIDiff.<n>PIDiff avoids semantic entanglement and achieves accurate feature extraction and localization.
arXiv Detail & Related papers (2025-05-08T09:26:28Z)
Bringing Characters to New Stories: Training-Free Theme-Specific Image Generation via Dynamic Visual Prompting [71.29100512700064]
We present T-Prompter, a training-free method for theme-specific image generation. T-Prompter integrates reference images into generative models, allowing users to seamlessly specify the target theme. Our approach enables consistent story generation, character design, realistic character generation, and style-guided image generation.
arXiv Detail & Related papers (2025-01-26T19:01:19Z)
StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization [36.14275850149665]
We propose a novel knowledge graph, namely Character Graph (textbfCG), which comprehensively represents various story-related knowledge. We then introduce StoryWeaver, an image generator that achieve Customization via Character Graph (textbfC-CG), capable of consistent story visualization with rich text semantics.
arXiv Detail & Related papers (2024-12-10T10:16:50Z)
Evolving Storytelling: Benchmarks and Methods for New Character Customization with Diffusion Models [79.21968152209193]
We introduce the NewEpisode benchmark to evaluate generative models' adaptability in generating new stories with fresh characters. We propose EpicEvo, a method that customizes a diffusion-based visual story generation model with a single story featuring the new characters seamlessly integrating them into established character dynamics.
arXiv Detail & Related papers (2024-05-20T07:54:03Z)
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation [0.0]
"InstantFamily" is an approach that employs a novel cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions.
arXiv Detail & Related papers (2024-04-30T10:16:21Z)
Infinite-ID: Identity-preserved Personalization via ID-semantics Decoupling Paradigm [31.06269858216316]
We propose Infinite-ID, an ID-semantics decoupling paradigm for identity-preserved personalization. We introduce an identity-enhanced training, incorporating an additional image cross-attention module to capture sufficient ID information. We also introduce a feature interaction mechanism that combines a mixed attention module with an AdaIN-mean operation to seamlessly merge the two streams.
arXiv Detail & Related papers (2024-03-18T13:39:53Z)
Masked Generative Story Transformer with Character Guidance and Caption Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences. Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately. We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story. Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities. We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z)
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)
Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters. Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks. We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem. Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z)
Improving Generation and Evaluation of Visual Stories via Semantic Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions. Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task. We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.