VACE: All-in-One Video Creation and Editing
- URL: http://arxiv.org/abs/2503.07598v2
- Date: Tue, 11 Mar 2025 06:44:25 GMT
- Title: VACE: All-in-One Video Creation and Editing
- Authors: Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu,
- Abstract summary: VACE enables users to perform Video tasks within an All-in-one framework for Creation and Editing.<n>We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing.
- Score: 18.809248697934397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.
Related papers
- VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.
VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - Get In Video: Add Anything You Want to the Video [48.06070610416688]
Video editing increasingly demands the ability to incorporate specific real-world instances into existing footage.<n>Current approaches fail to capture the unique visual characteristics of particular subjects and ensure natural instance/scene interactions.<n>We introduce "Get-In-Video Editing", where users provide reference images to precisely specify visual elements they wish to incorporate into videos.
arXiv Detail & Related papers (2025-03-08T16:27:53Z) - UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics [74.10447111842504]
UniReal is a unified framework designed to address various image generation and editing tasks.<n>Inspired by recent video generation models, we propose a unifying approach that treats image-level tasks as discontinuous video generation.<n>Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision.
arXiv Detail & Related papers (2024-12-10T18:59:55Z) - SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent)<n>SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models.<n> Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Streaming Video Model [90.24390609039335]
We propose to unify video understanding tasks into one streaming video architecture, referred to as Streaming Vision Transformer (S-ViT)
S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve frame-based video tasks.
The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition.
arXiv Detail & Related papers (2023-03-30T08:51:49Z) - PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
using Synthetic Scene Data [85.48684148629634]
We propose an approach to leverage synthetic scene data for improving video understanding.
We present a multi-task prompt learning approach for video transformers.
We show strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-12-08T18:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.