Related papers: AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks

URL: http://arxiv.org/abs/2403.14468v4
Date: Sun, 03 Nov 2024 21:16:54 GMT
Title: AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
Authors: Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen,
Abstract summary: We introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods.
Score: 41.640692114423544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods. Furthermore, AnyV2V significantly outperformed these baselines in human evaluations, demonstrating notable improvements in visual consistency with the source video while producing high-quality edits across all editing tasks.

Related papers

InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction [10.855393943204728]
We present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training.
arXiv Detail & Related papers (2025-03-26T07:30:58Z)
V2Edit: Versatile Video Diffusion Editor for Videos and 3D Scenes [29.80140472486948]
V$2$Edit is a training-free framework for instruction-guided video and 3D scene editing. We introduce a progressive strategy that decomposes complex editing tasks into simpler subtasks. We extend V$2$Edit to 3D scene editing via a "render-edit-reconstruct" process, enabling high-quality, 3D-consistent edits.
arXiv Detail & Related papers (2025-03-13T17:59:55Z)
DIVE: Taming DINO for Subject-Driven Video Editing [49.090071984272576]
DINO-guided Video Editing (DIVE) is a framework designed to facilitate subject-driven editing in source videos. DIVE employs DINO features to align with the motion trajectory of the source video. For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model.
arXiv Detail & Related papers (2024-12-04T14:28:43Z)
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z)
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [11.09708780767668]
We present a shape-consistent video editing method, namely StableV2V, in this paper. Our method decomposes the entire editing pipeline into several sequential procedures, where it edits the first video frame, then establishes an alignment between the delivered motions and user prompts, and eventually propagates the edited contents to all other frames based on such alignment. Experimental results and analyses illustrate the outperforming performance, visual consistency, and inference efficiency of our method compared to existing state-of-the-art studies.
arXiv Detail & Related papers (2024-11-17T11:48:01Z)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework. Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
Video-P2P: Video Editing with Cross-attention Control [68.64804243427756]
Video-P2P is a novel framework for real-world video editing with cross-attention control. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes.
arXiv Detail & Related papers (2023-03-08T17:53:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.