Human Video Translation via Query Warping
        - URL: http://arxiv.org/abs/2402.12099v2
 - Date: Tue, 21 May 2024 04:04:01 GMT
 - Title: Human Video Translation via Query Warping
 - Authors: Haiming Zhu, Yangyang Xu, Shengfeng He, 
 - Abstract summary: We present QueryWarp, a novel framework for temporally coherent human motion video translation.
We use appearance flows to warp the previous frame's query token, aligning it with the current frame's query.
This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation.
 - Score: 38.9185553719231
 - License: http://creativecommons.org/licenses/by-nc-sa/4.0/
 - Abstract:   In this paper, we present QueryWarp, a novel framework for temporally coherent human motion video translation. Existing diffusion-based video editing approaches that rely solely on key and value tokens to ensure temporal consistency, which scarifies the preservation of local and structural regions. In contrast, we aim to consider complementary query priors by constructing the temporal correlations among query tokens from different frames. Initially, we extract appearance flows from source poses to capture continuous human foreground motion. Subsequently, during the denoising process of the diffusion model, we employ appearance flows to warp the previous frame's query token, aligning it with the current frame's query. This query warping imposes explicit constraints on the outputs of self-attention layers, effectively guaranteeing temporally coherent translation. We perform experiments on various human motion video translation tasks, and the results demonstrate that our QueryWarp framework surpasses state-of-the-art methods both qualitatively and quantitatively. 
 
       
      
        Related papers
        - Emergent Temporal Correspondences from Video Diffusion Transformers [30.83001895223298]
We introduce DiffTrack, the first quantitative analysis framework designed to answer this question.<n>Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching.<n>We extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training.
arXiv  Detail & Related papers  (2025-06-20T17:59:55Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in   Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.
We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv  Detail & Related papers  (2025-01-15T18:59:15Z) - Discrete to Continuous: Generating Smooth Transition Poses from Sign   Language Observation [45.214169930573775]
We propose a conditional diffusion model to synthesize contextually smooth transition frames.
Our approach transforms the unsupervised problem of transition frame generation into a supervised training task.
Experiments on the PHO14TENIX, USTC-CSL100, and USTC-500 datasets demonstrate the effectiveness of our method.
arXiv  Detail & Related papers  (2024-11-25T15:06:49Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in   Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv  Detail & Related papers  (2024-07-11T17:34:51Z) - Contextualized Diffusion Models for Text-Guided Image and Video   Generation [67.69171154637172]
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing.
We propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample.
We generalize our model to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing.
arXiv  Detail & Related papers  (2024-02-26T15:01:16Z) - LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video
  Translation [21.815083817914843]
We propose a new zero-shot video-to-video translation framework, named textitLatentWarp.
Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space.
Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
arXiv  Detail & Related papers  (2023-11-01T08:02:57Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv  Detail & Related papers  (2023-08-11T12:17:24Z) - Counterfactual Cross-modality Reasoning for Weakly Supervised Video
  Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv  Detail & Related papers  (2023-08-10T15:45:45Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv  Detail & Related papers  (2021-11-22T18:59:58Z) - Motion-blurred Video Interpolation and Extrapolation [72.3254384191509]
We present a novel framework for deblurring, interpolating and extrapolating sharp frames from a motion-blurred video in an end-to-end manner.
To ensure temporal coherence across predicted frames and address potential temporal ambiguity, we propose a simple, yet effective flow-based rule.
arXiv  Detail & Related papers  (2021-03-04T12:18:25Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.