Related papers: StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives

URL: http://arxiv.org/abs/2602.21273v1
Date: Tue, 24 Feb 2026 16:07:02 GMT
Title: StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
Authors: Jinghao Hu, Yuhe Zhang, GuoHua Geng, Kang Li, Han Zhang,
Abstract summary: We propose a zero-shot pipeline that produces temporally coherent, identity-preserving image sequences.<n>Story delivers expressive interactions and evolving yet stable scenes.
Score: 7.243114047801061
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.

Related papers

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z)
DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling [1.7683026013361776]
DeCorStory is a training-free inference-time framework that reduces inter-frame semantic interference.<n>It applies prompt embedding decorrelation to frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information.<n> Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity.
arXiv Detail & Related papers (2026-02-01T16:07:30Z)
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z)
TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing [56.73004765030206]
Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency.<n>We propose TripleFDS, a novel framework for STE with disentangled modular attributes.<n>TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks.
arXiv Detail & Related papers (2025-11-17T14:15:03Z)
Narrative-to-Scene Generation: An LLM-Driven Pipeline for 2D Game Environments [0.09821874476902966]
We present a lightweight pipeline that transforms short narrative prompts into a sequence of 2D tile-based game scenes.<n>Given an LLM-generated narrative, our system identifies three key time frames, extracts spatial predicates, and retrieves visual assets.<n>A layered terrain is generated using Cellular Automata, and objects are placed using spatial rules grounded in the predicate structure.
arXiv Detail & Related papers (2025-08-31T01:45:56Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context [50.572907418430155]
ContextualStory is a framework designed to generate coherent story frames and extend frames for visual storytelling.<n>We introduce a Storyline Contextualizer to enrich context in storyline embedding, and a StoryFlow Adapter to measure scene changes between frames.<n>Experiments on PororoSV and FlintstonesSV datasets demonstrate that ContextualStory significantly outperforms existing SOTA methods in both story visualization and continuation.
arXiv Detail & Related papers (2024-07-13T05:02:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.