UniVideo: Unified Understanding, Generation, and Editing for Videos
- URL: http://arxiv.org/abs/2510.08377v2
- Date: Tue, 21 Oct 2025 17:40:13 GMT
- Title: UniVideo: Unified Understanding, Generation, and Editing for Videos
- Authors: Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen,
- Abstract summary: We present UniVideo, a versatile framework that extends unified modeling to the video domain.<n>UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm.<n>We show that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing.
- Score: 60.90505182401494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.
Related papers
- SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model [50.329905849190176]
SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing.<n>Supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio.
arXiv Detail & Related papers (2026-02-25T11:47:00Z) - Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z) - Omni-Video 2: Scaling MLLM-Conditioned Diffusion for Unified Video Generation and Editing [21.525921468472685]
We present a scalable and computationally efficient model that connects pretrained multimodal large-language models (MLLMs) with video diffusion models for unified video generation and editing.<n>Our key idea is to exploit the understanding and reasoning capabilities of MLLMs to produce explicit target captions to interpret user instructions.<n>We evaluate the performance of Omni-Video 2 on the FiVE benchmark for fine-grained video editing and the VBench benchmark for text-to-video generation.
arXiv Detail & Related papers (2026-02-09T15:56:05Z) - VINO: A Unified Visual Generator with Interleaved OmniModal Context [36.71641694179164]
VINO is a unified visual generator that performs image and video generation and editing within a single framework.<n>Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone.
arXiv Detail & Related papers (2026-01-05T18:56:34Z) - OmniV2V: Versatile Video Generation and Editing via Dynamic Content Manipulation [22.970558073760433]
We propose OmniV2V, a video model capable of generating and editing videos across different scenarios based on various operations.<n>In addition, we design a visual-text instruction module based on LLaVA, enabling the model to effectively understand the correspondence between visual content and instructions.<n>Experiments show that OmniV2V works as well as, and sometimes better than, the best existing open-source and commercial models for many video generation and editing tasks.
arXiv Detail & Related papers (2025-06-02T15:42:06Z) - VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models [43.46536102838717]
VideoDreamer is a novel framework for customized multi-subject text-to-video generation.<n>It can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.