RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models
- URL: http://arxiv.org/abs/2312.04524v1
- Date: Thu, 7 Dec 2023 18:43:45 GMT
- Title: RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing
with Diffusion Models
- Authors: Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M. Rehg, Pinar
Yanardag
- Abstract summary: RAVE is a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training.
It produces high-quality videos while preserving original motion and semantic structure.
RAVE is capable of a wide range of edits, from local attribute modifications to shape transformations.
- Score: 19.792535444735957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in diffusion-based models have demonstrated significant
success in generating images from text. However, video editing models have not
yet reached the same level of visual quality and user control. To address this,
we introduce RAVE, a zero-shot video editing method that leverages pre-trained
text-to-image diffusion models without additional training. RAVE takes an input
video and a text prompt to produce high-quality videos while preserving the
original motion and semantic structure. It employs a novel noise shuffling
strategy, leveraging spatio-temporal interactions between frames, to produce
temporally consistent videos faster than existing methods. It is also efficient
in terms of memory requirements, allowing it to handle longer videos. RAVE is
capable of a wide range of edits, from local attribute modifications to shape
transformations. In order to demonstrate the versatility of RAVE, we create a
comprehensive video evaluation dataset ranging from object-focused scenes to
complex human activities like dancing and typing, and dynamic scenes featuring
swimming fish and boats. Our qualitative and quantitative experiments highlight
the effectiveness of RAVE in diverse video editing scenarios compared to
existing methods. Our code, dataset and videos can be found in
https://rave-video.github.io.
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [27.736354114287725]
We present a framework that harnesses the power of a text-to-image diffusion model for the task of text-driven video editing.
Our method generates a high-quality video that adheres to the target text, while preserving the spatial layout and motion of the input video.
Our framework does not require any training or fine-tuning, and can work in conjunction with any off-the-shelf text-to-image editing method.
arXiv Detail & Related papers (2023-07-19T18:00:03Z) - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style
Transfer [13.098901971644656]
This paper proposes a zero-shot video stylization method named Style-A-Video.
Uses a generative pre-trained transformer with an image latent diffusion model to achieve a concise text-controlled video stylization.
Tests show that we can attain superior content preservation and stylistic performance while incurring less consumption than previous solutions.
arXiv Detail & Related papers (2023-05-09T14:03:27Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing.
Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame.
We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.