DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
- URL: http://arxiv.org/abs/2602.01306v1
- Date: Sun, 01 Feb 2026 16:07:30 GMT
- Title: DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
- Authors: Ayushman Sarkar, Zhenyu Yu, Mohd Yamani Idna Idris,
- Abstract summary: DeCorStory is a training-free inference-time framework that reduces inter-frame semantic interference.<n>It applies prompt embedding decorrelation to frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information.<n> Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity.
- Score: 1.7683026013361776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory
Related papers
- StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives [7.243114047801061]
We propose a zero-shot pipeline that produces temporally coherent, identity-preserving image sequences.<n>Story delivers expressive interactions and evolving yet stable scenes.
arXiv Detail & Related papers (2026-02-24T16:07:02Z) - ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation [6.4611000755192585]
ReDiStory is a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization.<n>It reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision.<n> Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency.
arXiv Detail & Related papers (2026-02-01T16:04:40Z) - Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration [57.02757226679549]
We introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task.<n>We propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between semantic and style visual tokens.<n>Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality.
arXiv Detail & Related papers (2026-01-10T16:01:14Z) - ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation [14.341691123354195]
ASemconsist enables explicit semantic control over character identity without sacrificing prompt alignment.<n>Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs.
arXiv Detail & Related papers (2025-12-29T07:06:57Z) - SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z) - Infinite-Story: A Training-Free Consistent Text-to-Image Generation [21.872330710303036]
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation.<n>Our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency.<n>Our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models.
arXiv Detail & Related papers (2025-11-17T05:46:16Z) - ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z) - One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.<n>They struggle to support the consistent generation of identity-preserving requirements for storytelling.<n>We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z) - Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem.
We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.