TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering
- URL: http://arxiv.org/abs/2509.04123v1
- Date: Thu, 04 Sep 2025 11:37:06 GMT
- Title: TaleDiffusion: Multi-Character Story Generation with Dialogue Rendering
- Authors: Ayan Banerjee, Josep Lladós, Umapada Pal, Anjan Dutta,
- Abstract summary: We introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process.<n>Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues.<n>We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement.
- Score: 13.076013597876623
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Text-to-story visualization is challenging due to the need for consistent interaction among multiple characters across frames. Existing methods struggle with character consistency, leading to artifact generation and inaccurate dialogue rendering, which results in disjointed storytelling. In response, we introduce TaleDiffusion, a novel framework for generating multi-character stories with an iterative process, maintaining character consistency, and accurate dialogue assignment via postprocessing. Given a story, we use a pre-trained LLM to generate per-frame descriptions, character details, and dialogues via in-context learning, followed by a bounded attention-based per-box mask technique to control character interactions and minimize artifacts. We then apply an identity-consistent self-attention mechanism to ensure character consistency across frames and region-aware cross-attention for precise object placement. Dialogues are also rendered as bubbles and assigned to characters via CLIPSeg. Experimental results demonstrate that TaleDiffusion outperforms existing methods in consistency, noise reduction, and dialogue rendering.
Related papers
- Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts [20.281732318265483]
We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue.<n>Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior.<n>We render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives.
arXiv Detail & Related papers (2025-05-22T15:54:42Z) - VisAgent: Narrative-Preserving Story Visualization Framework [5.86192577938549]
VisAgent is a training-free framework designed to comprehend and visualize pivotal scenes within a given story.<n>By considering story distillation, semantic consistency, and contextual coherence, VisAgent employs an agentic workflow.<n>The empirically validated effectiveness confirms the framework's suitability for practical story visualization applications.
arXiv Detail & Related papers (2025-03-04T08:41:45Z) - ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z) - LLM Blueprint: Enabling Text-to-Image Generation with Complex and
Detailed Prompts [60.54912319612113]
Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts.
We present a novel approach leveraging Large Language Models (LLMs) to extract critical components from text prompts.
Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models.
arXiv Detail & Related papers (2023-10-16T17:57:37Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context.
Our method outperforms prior state-of-the-art in generating frames with high visual quality.
Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - A Benchmark for Understanding and Generating Dialogue between Characters
in Stories [75.29466820496913]
We present the first study to explore whether machines can understand and generate dialogue in stories.
We propose two new tasks including Masked Dialogue Generation and Dialogue Speaker Recognition.
We show the difficulty of the proposed tasks by testing existing models with automatic and manual evaluation on DialStory.
arXiv Detail & Related papers (2022-09-18T10:19:04Z) - Local Contextual Attention with Hierarchical Structure for Dialogue Act
Recognition [14.81680798372891]
We design a hierarchical model based on self-attention to capture intra-sentence and inter-sentence information.
Based on the found that the length of dialog affects the performance, we introduce a new dialog segmentation mechanism.
arXiv Detail & Related papers (2020-03-12T22:26:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.