UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
- URL: http://arxiv.org/abs/2511.08521v1
- Date: Wed, 12 Nov 2025 02:02:50 GMT
- Title: UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
- Authors: Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei,
- Abstract summary: We introduce UniVA, an omni-capable multi-agent framework for next-generation video generalists.<n>UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow.<n>We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation.
- Score: 107.04196084992907
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation $\rightarrow$ multi-round editing $\rightarrow$ object segmentation $\rightarrow$ compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
Related papers
- Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation [15.004606775581356]
LAVES is a hierarchical multi-agent system for generating high-quality instructional videos from educational problems.<n>In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost.
arXiv Detail & Related papers (2026-02-12T10:14:36Z) - A Versatile Multimodal Agent for Multimedia Content Generation [66.86040734610073]
We propose a MultiMedia-Agent designed to automate complex content creation tasks.<n>Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment.
arXiv Detail & Related papers (2026-01-06T18:49:47Z) - Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration [73.65102758687289]
This study introduces three innovations to improve multi-agent collaboration.<n>First, we propose OmniAgent, a hierarchical, graph-based multi-agent framework for long video generation.<n>Second, inspired by context engineering, we propose hypergraph nodes that enable temporary group discussions.
arXiv Detail & Related papers (2025-10-25T20:34:18Z) - Communicative Agents for Slideshow Storytelling Video Generation based on LLMs [4.389263274945811]
Video-Generation-Team (VGTeam) is a novel slide show video generation system designed to redefine the video creation pipeline.<n>By emulating the sequential stages of traditional video production, VGTeam achieves remarkable improvements in both efficiency and scalability.<n>On average, the system generates videos at a cost of only $0.103, with a successful generation rate of 98.4%.
arXiv Detail & Related papers (2025-09-01T09:04:07Z) - Yan: Foundational Interactive Video Generation [25.398980906541524]
Yan is a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing.<n>We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process.<n>We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text.
arXiv Detail & Related papers (2025-08-12T03:34:21Z) - AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z) - InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction [35.285466934451904]
This paper introduces textscInfantAgent-Next, a generalist agent capable of interacting with computers in a multimodal manner.<n>Unlike existing approaches that either build intricate around a single large model or only provide modularity, our agent integrates tool-based and pure vision agents.
arXiv Detail & Related papers (2025-05-16T05:43:27Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - AesopAgent: Agent-driven Evolutionary System on Story-to-Video
Production [34.665965986359645]
AesopAgent is an Agent-driven Evolutionary System on Story-to-Video Production.
The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily.
Our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
arXiv Detail & Related papers (2024-03-12T02:30:50Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.