VidToMe: Video Token Merging for Zero-Shot Video Editing
- URL: http://arxiv.org/abs/2312.10656v2
- Date: Tue, 19 Dec 2023 13:54:15 GMT
- Title: VidToMe: Video Token Merging for Zero-Shot Video Editing
- Authors: Xirui Li, Chao Ma, Xiaokang Yang, Ming-Hsuan Yang
- Abstract summary: We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
- Score: 100.79999871424931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have made significant advances in generating high-quality
images, but their application to video generation has remained challenging due
to the complexity of temporal motion. Zero-shot video editing offers a solution
by utilizing pre-trained image diffusion models to translate source videos into
new ones. Nevertheless, existing methods struggle to maintain strict temporal
consistency and efficient memory consumption. In this work, we propose a novel
approach to enhance temporal consistency in generated videos by merging
self-attention tokens across frames. By aligning and compressing temporally
redundant tokens across frames, our method improves temporal coherence and
reduces memory consumption in self-attention computations. The merging strategy
matches and aligns tokens according to the temporal correspondence between
frames, facilitating natural temporal consistency in generated video frames. To
manage the complexity of video processing, we divide videos into chunks and
develop intra-chunk local token merging and inter-chunk global token merging,
ensuring both short-term video continuity and long-term content consistency.
Our video editing approach seamlessly extends the advancements in image editing
to video editing, rendering favorable results in temporal consistency over
state-of-the-art methods.
Related papers
- Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos.
We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance.
During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video
Translation [21.815083817914843]
We propose a new zero-shot video-to-video translation framework, named textitLatentWarp.
Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space.
Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
arXiv Detail & Related papers (2023-11-01T08:02:57Z) - VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing [18.24307442582304]
We introduce VidEdit, a novel method for zero-shot text-based video editing.
Our experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset.
arXiv Detail & Related papers (2023-06-14T19:15:49Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.