Related papers: NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

URL: http://arxiv.org/abs/2602.09070v2
Date: Thu, 12 Feb 2026 02:33:29 GMT
Title: NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
Authors: Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu,
Abstract summary: NarraScore is a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic.<n>NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism.<n>NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead.
Score: 59.6128550986024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

Related papers

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z)
NarrativeTrack: Evaluating Video Language Models Beyond the Frame [10.244330591706744]
We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs.<n>We decompose videos into constituent entities and examine their continuity via a Compositional Reasoning (CRP) framework.<n>CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning.
arXiv Detail & Related papers (2026-01-03T07:12:55Z)
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent [28.12183839499528]
SceneWeaver is a framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement.<n>It can identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations.<n>It generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
arXiv Detail & Related papers (2025-09-24T09:06:41Z)
InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z)
Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs [0.8702432681310401]
Aether Weaver is a novel framework for narrative co-generation that overcomes limitations of multimodal text-to-visual pipelines.<n>Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes.
arXiv Detail & Related papers (2025-07-29T15:01:31Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z)
Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models [6.668241588219693]
Visual storytelling is increasingly desired beyond real-world imagery. Current techniques, which typically use autoregressive decoders, suffer from low inference speed and are not well-suited for synthetic scenes. We propose a novel diffusion-based system DiffuVST, which models a series of visual descriptions as a single conditional denoising process.
arXiv Detail & Related papers (2023-12-12T08:40:38Z)
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization. The emotions have extremely varied temporal dynamics. The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.