Related papers: CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

URL: http://arxiv.org/abs/2512.22536v1
Date: Sat, 27 Dec 2025 09:38:34 GMT
Title: CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation
Authors: Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang,
Abstract summary: CoAgent is a framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline.<n>A Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues.<n>A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots.<n>A pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow.
Score: 9.91271343855315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.

Related papers

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z)
VideoMemory: Toward Consistent Video Generation via Memory Integration [28.605816634949814]
VideoMemory integrates narrative planning with visual generation through a Dynamic Memory Bank.<n>The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds.<n>This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation.
arXiv Detail & Related papers (2026-01-07T07:10:32Z)
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z)
AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z)
Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media [35.60423976124236]
We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines.<n>At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion.<n>Our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
arXiv Detail & Related papers (2025-09-20T21:22:56Z)
TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration [36.255570023185506]
Existing multi-modal fusion methods apply static frame-based image fusion techniques directly to video fusion tasks.<n>We propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration.
arXiv Detail & Related papers (2025-08-25T09:12:55Z)
Cut2Next: Generating Next Shot via In-Context Tuning [93.14744132897428]
Multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity.<n>Current methods often prioritize basic visual consistency, neglecting crucial editing patterns.<n>We introduce Next Shot Generation (NSG): a subsequent, high-quality shot that critically synthesizes professional editing patterns.
arXiv Detail & Related papers (2025-08-11T17:56:59Z)
Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence. We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.