Consistent Story Generation with Asymmetry Zigzag Sampling
- URL: http://arxiv.org/abs/2506.09612v3
- Date: Wed, 02 Jul 2025 20:03:42 GMT
- Title: Consistent Story Generation with Asymmetry Zigzag Sampling
- Authors: Mingxiao Li, Mang Ning, Marie-Francine Moens,
- Abstract summary: We introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing.<n>Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics.<n>Our method significantly outperforms previous approaches in generating coherent and consistent visual stories.
- Score: 24.504304503689866
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image generation models have made significant progress in producing high-quality images from textual descriptions, yet they continue to struggle with maintaining subject consistency across multiple images, a fundamental requirement for visual storytelling. Existing methods attempt to address this by either fine-tuning models on large-scale story visualization datasets, which is resource-intensive, or by using training-free techniques that share information across generations, which still yield limited success. In this paper, we introduce a novel training-free sampling strategy called Zigzag Sampling with Asymmetric Prompts and Visual Sharing to enhance subject consistency in visual story generation. Our approach proposes a zigzag sampling mechanism that alternates between asymmetric prompting to retain subject characteristics, while a visual sharing module transfers visual cues across generated images to %further enforce consistency. Experimental results, based on both quantitative metrics and qualitative evaluations, demonstrate that our method significantly outperforms previous approaches in generating coherent and consistent visual stories. The code is available at https://github.com/Mingxiao-Li/Asymmetry-Zigzag-StoryDiffusion.
Related papers
- StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization [31.250596607318364]
Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities.<n>This paper proposes an efficient consistent-subject-generation method.<n> Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios.
arXiv Detail & Related papers (2025-07-31T11:24:40Z) - WAVE: Warp-Based View Guidance for Consistent Novel View Synthesis Using a Single Image [3.4248731707266264]
This paper proposes a novel view-consistent image generation method which utilizes diffusion models without additional modules.<n>Our key idea is to enhance diffusion models with a training-free method that enables adaptive attention manipulation and noise reinitialization.<n>Our method improves view consistency across various diffusion models, demonstrating its broader applicability.
arXiv Detail & Related papers (2025-06-30T05:00:47Z) - Unified Autoregressive Visual Generation and Understanding with Continuous Tokens [52.21981295470491]
We present UniFluid, a unified autoregressive framework for joint visual generation and understanding.<n>Our unified autoregressive architecture processes multimodal image and text inputs, generating discrete tokens for text and continuous tokens for image.<n>We find though there is an inherent trade-off between the image generation and understanding task, a carefully tuned training recipe enables them to improve each other.
arXiv Detail & Related papers (2025-03-17T17:58:30Z) - ZePo: Zero-Shot Portrait Stylization with Faster Sampling [61.14140480095604]
This paper presents an inversion-free portrait stylization framework based on diffusion models that accomplishes content and style feature fusion in merely four sampling steps.
We propose a feature merging strategy to amalgamate redundant features in Consistency Features, thereby reducing the computational load of attention control.
arXiv Detail & Related papers (2024-08-10T08:53:41Z) - Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors.
Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables.
We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z) - OneActor: Consistent Character Generation via Cluster-Conditioned Guidance [29.426558840522734]
We propose a novel one-shot tuning paradigm, termed OneActor.
It efficiently performs consistent subject generation solely driven by prompts.
Our method is capable of multi-subject generation and compatible with popular diffusion extensions.
arXiv Detail & Related papers (2024-04-16T03:45:45Z) - Masked Generative Story Transformer with Character Guidance and Caption
Augmentation [2.1392064955842023]
Story visualization is a challenging generative vision task, that requires both visual quality and consistency between different frames in generated image sequences.
Previous approaches either employ some kind of memory mechanism to maintain context throughout an auto-regressive generation of the image sequence, or model the generation of the characters and their background separately.
We propose a completely parallel transformer-based approach, relying on Cross-Attention with past and future captions to achieve consistency.
arXiv Detail & Related papers (2024-03-13T13:10:20Z) - StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story
Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image.
We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image.
Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.