CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation
- URL: http://arxiv.org/abs/2512.22536v1
- Date: Sat, 27 Dec 2025 09:38:34 GMT
- Title: CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation
- Authors: Qinglin Zeng, Kaitong Cai, Ruiqi Chen, Qinhan Lv, Keze Wang,
- Abstract summary: CoAgent is a framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline.<n>A Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues.<n>A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots.<n>A pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow.
- Score: 9.91271343855315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
Related papers
- InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions [137.1784538723039]
We present a novel framework, dataset, and model that address three critical limitations in video synthesis.<n>Background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives are addressed.<n>We propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames.
arXiv Detail & Related papers (2026-03-04T02:10:32Z) - VideoMemory: Toward Consistent Video Generation via Memory Integration [28.605816634949814]
VideoMemory integrates narrative planning with visual generation through a Dynamic Memory Bank.<n>The Dynamic Memory Bank stores explicit visual and semantic descriptors for characters, props, and backgrounds.<n>This retrieval-update mechanism enables consistent portrayal of entities across distant shots and supports coherent long-form generation.
arXiv Detail & Related papers (2026-01-07T07:10:32Z) - STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z) - AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation [58.844504598618094]
We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation.<n>Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities.<n>We incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation.
arXiv Detail & Related papers (2025-12-11T18:59:34Z) - Prompt-Driven Agentic Video Editing System: Autonomous Comprehension of Long-Form, Story-Driven Media [35.60423976124236]
We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines.<n>At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion.<n>Our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
arXiv Detail & Related papers (2025-09-20T21:22:56Z) - TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration [36.255570023185506]
Existing multi-modal fusion methods apply static frame-based image fusion techniques directly to video fusion tasks.<n>We propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration.
arXiv Detail & Related papers (2025-08-25T09:12:55Z) - Cut2Next: Generating Next Shot via In-Context Tuning [93.14744132897428]
Multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity.<n>Current methods often prioritize basic visual consistency, neglecting crucial editing patterns.<n>We introduce Next Shot Generation (NSG): a subsequent, high-quality shot that critically synthesizes professional editing patterns.
arXiv Detail & Related papers (2025-08-11T17:56:59Z) - Text2Story: Advancing Video Storytelling with Text Guidance [19.901781116843942]
We introduce a novel storytelling framework that achieves this by integrating scene and action prompts through dynamics-inspired prompt mixing.<n>We propose a dynamics-informed prompt weighting mechanism that adaptively balances the influence of scene and action prompts at each diffusion timestep.<n>To further enhance motion continuity, we incorporate a semantic action representation to encode high-level action semantics into the blending process.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.