Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2509.04446v1
- Date: Thu, 04 Sep 2025 17:59:34 GMT
- Title: Plot'n Polish: Zero-shot Story Visualization and Disentangled Editing with Text-to-Image Diffusion Models
- Authors: Kiymet Akdemir, Jing Shi, Kushal Kafle, Brian Price, Pinar Yanardag,
- Abstract summary: Plot'n Polish is a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.
- Score: 17.541927454984318
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image diffusion models have demonstrated significant capabilities to generate diverse and detailed visuals in various domains, and story visualization is emerging as a particularly promising application. However, as their use in real-world creative domains increases, the need for providing enhanced control, refinement, and the ability to modify images post-generation in a consistent manner becomes an important challenge. Existing methods often lack the flexibility to apply fine or coarse edits while maintaining visual and narrative consistency across multiple frames, preventing creators from seamlessly crafting and refining their visual stories. To address these challenges, we introduce Plot'n Polish, a zero-shot framework that enables consistent story generation and provides fine-grained control over story visualizations at various levels of detail.
Related papers
- StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization [31.250596607318364]
Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities.<n>This paper proposes an efficient consistent-subject-generation method.<n> Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios.
arXiv Detail & Related papers (2025-07-31T11:24:40Z) - Audit & Repair: An Agentic Framework for Consistent Story Visualization in Text-to-Image Diffusion Models [3.3454373538792552]
We propose a collaborative multi-agent framework that autonomously identifies, corrects, and refines inconsistencies across multi-panel story visualizations.<n>The agents operate in an iterative loop, enabling fine-grained, panel-level updates without re-generating entire sequences.
arXiv Detail & Related papers (2025-06-23T17:59:29Z) - Conditional Text-to-Image Generation with Reference Guidance [81.99538302576302]
This paper explores using additional conditions of an image that provides visual guidance of the particular subjects for diffusion models to generate.
We develop several small-scale expert plugins that efficiently endow a Stable Diffusion model with the capability to take different references.
Our expert plugins demonstrate superior results than the existing methods on all tasks, each containing only 28.55M trainable parameters.
arXiv Detail & Related papers (2024-11-22T21:38:51Z) - ORACLE: Leveraging Mutual Information for Consistent Character Generation with LoRAs in Diffusion Models [3.7599363231894185]
We introduce a novel framework designed to produce consistent character representations from a single text prompt.
Our framework outperforms existing methods in generating characters with consistent visual identities.
arXiv Detail & Related papers (2024-06-04T23:39:08Z) - MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual
Storytelling via Multi-Layered Semantic-Aware Denoising [42.20750912837316]
MagicScroll is a progressive diffusion-based image generation framework with a novel semantic-aware denoising process.
It enables fine-grained control over the generated image on object, scene, and background levels with text, image, and layout conditions.
It showcases promising results in aligning with the narrative text, improving visual coherence, and engaging the audience.
arXiv Detail & Related papers (2023-12-18T03:09:05Z) - DiffuVST: Narrating Fictional Scenes with Global-History-Guided
Denoising Models [6.668241588219693]
Visual storytelling is increasingly desired beyond real-world imagery.
Current techniques, which typically use autoregressive decoders, suffer from low inference speed and are not well-suited for synthetic scenes.
We propose a novel diffusion-based system DiffuVST, which models a series of visual descriptions as a single conditional denoising process.
arXiv Detail & Related papers (2023-12-12T08:40:38Z) - AutoStory: Generating Diverse Storytelling Images with Minimal Human
Effort [55.83007338095763]
We propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images.
We utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images.
arXiv Detail & Related papers (2023-11-19T06:07:37Z) - Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion
Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling.
We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module.
We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z) - Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist.
Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model.
Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.