Related papers: MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration

URL: http://arxiv.org/abs/2602.03028v1
Date: Tue, 03 Feb 2026 02:55:00 GMT
Title: MUSE: A Multi-agent Framework for Unconstrained Story Envisioning via Closed-Loop Cognitive Orchestration
Authors: Wenzhang Sun, Zhenyu Wang, Zhangchi Hu, Chunfeng Wang, Hao Li, Wei Chen,
Abstract summary: We develop a framework to generate long-form audio-visual stories from a short user prompt.<n>MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity.<n>MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.
Score: 16.61208703961799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating long-form audio-visual stories from a short user prompt remains challenging due to an intent-execution gap, where high-level narrative intent must be preserved across coherent, shot-level multimodal generation over long horizons. Existing approaches typically rely on feed-forward pipelines or prompt-only refinement, which often leads to semantic drift and identity inconsistency as sequences grow longer. We address this challenge by formulating storytelling as a closed-loop constraint enforcement problem and propose MUSE, a multi-agent framework that coordinates generation through an iterative plan-execute-verify-revise loop. MUSE translates narrative intent into explicit, machine-executable controls over identity, spatial composition, and temporal continuity, and applies targeted multimodal feedback to correct violations during generation. To evaluate open-ended storytelling without ground-truth references, we introduce MUSEBench, a reference-free evaluation protocol validated by human judgments. Experiments demonstrate that MUSE substantially improves long-horizon narrative coherence, cross-modal identity consistency, and cinematic quality compared with representative baselines.

Related papers

NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control [59.6128550986024]
NarraScore is a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic.<n>NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism.<n>NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead.
arXiv Detail & Related papers (2026-02-09T09:39:42Z)
Codified Foreshadowing-Payoff Text Generation [67.01182739162142]
Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes.<n>Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups.<n>We introduce Codified Foreshadowing-Payoff Generation, a novel framework that reframes narrative quality through the lens of payoff realization.
arXiv Detail & Related papers (2026-01-11T19:05:37Z)
CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation [9.91271343855315]
CoAgent is a framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline.<n>A Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues.<n>A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots.<n>A pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow.
arXiv Detail & Related papers (2025-12-27T09:38:34Z)
Living the Novel: A System for Generating Self-Training Timeline-Aware Conversational Agents from Novels [50.43968216132018]
We present an end-to-end system that transforms any literary work into an immersive, multi-character conversational experience.<n>This system is designed to solve two fundamental challenges for LLM-driven characters.
arXiv Detail & Related papers (2025-12-08T11:57:46Z)
Taming a Retrieval Framework to Read Images in Humanlike Manner for Augmenting Generation of MLLMs [23.638717678491986]
Multimodal large language models (MLLMs) often fail in fine-grained visual question answering.<n>We present Human-Like Retrieval-Augmented Generation (HuLiRAG), a framework that stages multimodal reasoning as a what--where--reweight'' cascade.
arXiv Detail & Related papers (2025-10-12T03:22:33Z)
Chronological Passage Assembling in RAG framework for Temporal Question Answering [12.583700669377803]
We propose ChronoRAG, a novel RAG framework specialized for narrative texts.<n>This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages.<n>We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA and GutenQAdataset.
arXiv Detail & Related papers (2025-08-26T07:23:23Z)
Re:Verse -- Can Your VLM Read a Manga? [14.057881684215047]
Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning.<n>We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment.<n>We conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning.
arXiv Detail & Related papers (2025-08-11T22:40:05Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Generating Long-form Story Using Dynamic Hierarchical Outlining with Memory-Enhancement [29.435378306293583]
We propose Dynamic Hierarchical Outlining with Memory-Enhancement long-form story generation method, named DOME, to generate the long-form story with coherent content and plot.<n>A Memory-Enhancement Module (MEM) based on temporal knowledge graphs is introduced to store and access the generated content.<n>Experiments demonstrate that DOME significantly improves the fluency, coherence, and overall quality of generated long stories compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-12-18T07:50:54Z)
Improving Pacing in Long-Form Story Planning [55.39443681232538]
We propose a CONCrete Outline ConTrol system to improve pacing when automatically generating story outlines. We first train a concreteness evaluator to judge which of two events is more concrete. In this work, we explore a vaguest-first expansion procedure that aims for uniform pacing.
arXiv Detail & Related papers (2023-11-08T04:58:29Z)
Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading [63.93888816206071]
We introduce MemWalker, a method that processes the long context into a tree of summary nodes. Upon receiving a query, the model navigates this tree in search of relevant information, and responds once it gathers sufficient information. We show that, beyond effective reading, MemWalker enhances explainability by highlighting the reasoning steps as it interactively reads the text; pinpointing the relevant text segments related to the query.
arXiv Detail & Related papers (2023-10-08T06:18:14Z)
Long Text Generation by Modeling Sentence-Level and Discourse-Level Coherence [59.51720326054546]
We propose a long text generation model, which can represent the prefix sentences at sentence level and discourse level in the decoding process. Our model can generate more coherent texts than state-of-the-art baselines.
arXiv Detail & Related papers (2021-05-19T07:29:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.