StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
- URL: http://arxiv.org/abs/2602.21273v1
- Date: Tue, 24 Feb 2026 16:07:02 GMT
- Title: StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
- Authors: Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang,
- Abstract summary: We propose a zero-shot pipeline that produces temporally coherent, identity-preserving image sequences.<n>Story delivers expressive interactions and evolving yet stable scenes.
- Score: 7.243114047801061
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
Related papers
- InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z) - DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling [1.7683026013361776]
DeCorStory is a training-free inference-time framework that reduces inter-frame semantic interference.<n>It applies prompt embedding decorrelation to frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information.<n> Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity.
arXiv Detail & Related papers (2026-02-01T16:07:30Z) - STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z) - TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing [56.73004765030206]
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency.<n>We propose TripleFDS, a novel framework for STE with disentangled modular attributes.<n>TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks.
arXiv Detail & Related papers (2025-11-17T14:15:03Z) - Narrative-to-Scene Generation: An LLM-Driven Pipeline for 2D Game Environments [0.09821874476902966]
We present a lightweight pipeline that transforms short narrative prompts into a sequence of 2D tile-based game scenes.<n>Given an LLM-generated narrative, our system identifies three key time frames, extracts spatial predicates, and retrieves visual assets.<n>A layered terrain is generated using Cellular Automata, and objects are placed using spatial rules grounded in the predicate structure.
arXiv Detail & Related papers (2025-08-31T01:45:56Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.