Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models
- URL: http://arxiv.org/abs/2310.01107v2
- Date: Sat, 24 Feb 2024 19:39:37 GMT
- Title: Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models
- Authors: Hyeonho Jeong and Jong Chul Ye
- Abstract summary: Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
- Score: 65.268245109828
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent endeavors in video editing have showcased promising results in
single-attribute editing or style transfer tasks, either by training
text-to-video (T2V) models on text-video data or adopting training-free
methods. However, when confronted with the complexities of multi-attribute
editing scenarios, they exhibit shortcomings such as omitting or overlooking
intended attribute changes, modifying the wrong elements of the input video,
and failing to preserve regions of the input video that should remain intact.
To address this, here we present a novel grounding-guided video-to-video
translation framework called Ground-A-Video for multi-attribute video editing.
Ground-A-Video attains temporally consistent multi-attribute editing of input
videos in a training-free manner without aforementioned shortcomings. Central
to our method is the introduction of Cross-Frame Gated Attention which
incorporates groundings information into the latent representations in a
temporally consistent fashion, along with Modulated Cross-Attention and optical
flow guided inverted latents smoothing. Extensive experiments and applications
demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline
methods in terms of edit-accuracy and frame consistency. Further results and
code are available at our project page (http://ground-a-video.github.io).
Related papers
- I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - StableVideo: Text-driven Consistency-aware Diffusion Video Editing [24.50933856309234]
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time.
This paper introduces temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects.
We build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing.
arXiv Detail & Related papers (2023-08-18T14:39:16Z) - VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing.
Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models [68.31777975873742]
Recent attempts at video editing require significant text-to-video data and computation resources for training.
We propose vid2vid-zero, a simple yet effective method for zero-shot video editing.
Experiments and analyses show promising results in editing attributes, subjects, places, etc., in real-world videos.
arXiv Detail & Related papers (2023-03-30T17:59:25Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.