Related papers: DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling

DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling

URL: http://arxiv.org/abs/2602.01306v1
Date: Sun, 01 Feb 2026 16:07:30 GMT
Title: DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
Authors: Ayushman Sarkar, Zhenyu Yu, Mohd Yamani Idna Idris,
Abstract summary: DeCorStory is a training-free inference-time framework that reduces inter-frame semantic interference.<n>It applies prompt embedding decorrelation to frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information.<n> Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity.
Score: 1.7683026013361776
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory

Related papers

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives [7.243114047801061]
We propose a zero-shot pipeline that produces temporally coherent, identity-preserving image sequences.<n>Story delivers expressive interactions and evolving yet stable scenes.
arXiv Detail & Related papers (2026-02-24T16:07:02Z)
ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation [6.4611000755192585]
ReDiStory is a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization.<n>It reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision.<n> Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency.
arXiv Detail & Related papers (2026-02-01T16:04:40Z)
Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration [57.02757226679549]
We introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task.<n>We propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between semantic and style visual tokens.<n>Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality.
arXiv Detail & Related papers (2026-01-10T16:01:14Z)
ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation [14.341691123354195]
ASemconsist enables explicit semantic control over character identity without sacrificing prompt alignment.<n>Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs.
arXiv Detail & Related papers (2025-12-29T07:06:57Z)
SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioning [53.638998508418545]
This paper introduces a new task Image Collaborative and Captioning'' (SegCaptioning)<n>SegCaptioning aims to translate a straightforward prompt, like a bounding box around an object, into diverse semantic interpretations represented by (caption, masks) pairs.<n>This task poses significant challenges, including accurately capturing a user's intention from a minimal prompt while simultaneously predicting multiple semantically aligned caption words and masks.
arXiv Detail & Related papers (2025-12-01T18:33:04Z)
Infinite-Story: A Training-Free Consistent Text-to-Image Generation [21.872330710303036]
We present Infinite-Story, a training-free framework for consistent text-to-image (T2I) generation.<n>Our method addresses two key challenges in consistent T2I generation: identity inconsistency and style inconsistency.<n>Our method achieves state-of-the-art generation performance, while offering over 6X faster inference (1.72 seconds per image) than the existing fastest consistent T2I models.
arXiv Detail & Related papers (2025-11-17T05:46:16Z)
ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts.<n>They struggle to support the consistent generation of identity-preserving requirements for storytelling.<n>We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
Text-Only Training for Visual Storytelling [107.19873669536523]
We formulate visual storytelling as a visual-conditioned story generation problem. We propose a text-only training method that separates the learning of cross-modality alignment and story generation.
arXiv Detail & Related papers (2023-08-17T09:32:17Z)
FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism. We construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.