Related papers: LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

URL: http://arxiv.org/abs/2311.00353v1
Date: Wed, 1 Nov 2023 08:02:57 GMT
Title: LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation
Authors: Yuxiang Bao, Di Qiu, Guoliang Kang, Baochang Zhang, Bo Jin, Kaiye Wang, Pengfei Yan
Abstract summary: We propose a new zero-shot video-to-video translation framework, named textitLatentWarp. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space. Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
Score: 21.815083817914843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.

Related papers

Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z)
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [47.88160253507823]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism. CTGM incorporates the Temporal Information (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention.
arXiv Detail & Related papers (2024-08-15T14:47:44Z)
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio. We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
Human Video Translation via Query Warping [38.9185553719231]
We present QueryWarp, a novel framework for temporally coherent human motion video translation. We use appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation.
arXiv Detail & Related papers (2024-02-19T12:28:45Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images. Existing methods invert video frames individually often leading to undesired inconsistent results over time. We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID) Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
Progressive Temporal Feature Alignment Network for Video Inpainting [51.26380898255555]
Video convolution aims to fill in-temporal "corrupted regions" with plausible content. Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution. We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the warped feature from neighbouring frames.
arXiv Detail & Related papers (2021-04-08T04:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.