NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
- URL: http://arxiv.org/abs/2602.09070v2
- Date: Thu, 12 Feb 2026 02:33:29 GMT
- Title: NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control
- Authors: Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu,
- Abstract summary: NarraScore is a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic.<n>NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism.<n>NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead.
- Score: 59.6128550986024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
Related papers
- InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z) - NarrativeTrack: Evaluating Video Language Models Beyond the Frame [10.244330591706744]
We introduce NarrativeTrack, the first benchmark to evaluate narrative understanding in MLLMs.<n>We decompose videos into constituent entities and examine their continuity via a Compositional Reasoning (CRP) framework.<n>CRP challenges models to advance from temporal persistence to contextual evolution and fine-grained perceptual reasoning.
arXiv Detail & Related papers (2026-01-03T07:12:55Z) - SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent [28.12183839499528]
SceneWeaver is a framework that unifies diverse scene synthesis paradigms through tool-based iterative refinement.<n>It can identify semantic inconsistencies, invoke targeted tools, and update the environment over successive iterations.<n>It generalizes effectively to complex scenes with diverse instructions, marking a step toward general-purpose 3D environment generation.
arXiv Detail & Related papers (2025-09-24T09:06:41Z) - InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing [66.48064661467781]
We introduce sparse-frame video dubbing, a novel paradigm that strategically preserves references to maintain identity, iconic gestures, and camera trajectories.<n>We propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing.<n> Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2025-08-19T17:55:23Z) - Aether Weaver: Multimodal Affective Narrative Co-Generation with Dynamic Scene Graphs [0.8702432681310401]
Aether Weaver is a novel framework for narrative co-generation that overcomes limitations of multimodal text-to-visual pipelines.<n>Our system concurrently synthesizes textual narratives, dynamic scene graph representations, visual scenes, and affective soundscapes.
arXiv Detail & Related papers (2025-07-29T15:01:31Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z) - Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - DiffuVST: Narrating Fictional Scenes with Global-History-Guided
Denoising Models [6.668241588219693]
Visual storytelling is increasingly desired beyond real-world imagery.
Current techniques, which typically use autoregressive decoders, suffer from low inference speed and are not well-suited for synthetic scenes.
We propose a novel diffusion-based system DiffuVST, which models a series of visual descriptions as a single conditional denoising process.
arXiv Detail & Related papers (2023-12-12T08:40:38Z) - Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization.
The emotions have extremely varied temporal dynamics.
The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.