DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with
Diffusion Auto-encoder
- URL: http://arxiv.org/abs/2311.01811v2
- Date: Fri, 12 Jan 2024 10:25:33 GMT
- Title: DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with
Diffusion Auto-encoder
- Authors: Tao Liu, Chenpeng Du, Shuai Fan, Feilong Chen, Kai Yu
- Abstract summary: We propose DiffDub: Diffusion-based dubbing.
We first craft the Diffusion auto-encoder by an inpainting incorporating a mask to delineate editable zones and unaltered regions.
To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance.
- Score: 21.405442790474268
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating high-quality and person-generic visual dubbing remains a
challenge. Recent innovation has seen the advent of a two-stage paradigm,
decoupling the rendering and lip synchronization process facilitated by
intermediate representation as a conduit. Still, previous methodologies rely on
rough landmarks or are confined to a single speaker, thus limiting their
performance. In this paper, we propose DiffDub: Diffusion-based dubbing. We
first craft the Diffusion auto-encoder by an inpainting renderer incorporating
a mask to delineate editable zones and unaltered regions. This allows for
seamless filling of the lower-face region while preserving the remaining parts.
Throughout our experiments, we encountered several challenges. Primarily, the
semantic encoder lacks robustness, constricting its ability to capture
high-level features. Besides, the modeling ignored facial positioning, causing
mouth or nose jitters across frames. To tackle these issues, we employ
versatile strategies, including data augmentation and supplementary eye
guidance. Moreover, we encapsulated a conformer-based reference encoder and
motion generator fortified by a cross-attention mechanism. This enables our
model to learn person-specific textures with varying references and reduces
reliance on paired audio-visual data. Our rigorous experiments comprehensively
highlight that our ground-breaking approach outpaces existing methods with
considerable margins and delivers seamless, intelligible videos in
person-generic and multilingual scenarios.
Related papers
- Efficient Video Face Enhancement with Enhanced Spatial-Temporal Consistency [36.939731355462264]
This study proposes a novel and efficient blind video face enhancement method.
It restores high-quality videos from their compressed low-quality versions with an effective de-flickering mechanism.
Experiments conducted on the VFHQ-Test dataset demonstrate that our method surpasses the current state-of-the-art blind face video restoration and de-flickering methods on both efficiency and effectiveness.
arXiv Detail & Related papers (2024-11-25T15:14:36Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using
Score-Based Diffusion Models [57.9771859175664]
Recent generative-prior-based methods have shown promising blind face restoration performance.
Generating fine-grained facial details faithful to inputs remains a challenging problem.
We introduce a diffusion-based-prior inside a VQGAN architecture that focuses on learning the distribution over uncorrupted latent embeddings.
arXiv Detail & Related papers (2024-02-08T23:51:49Z) - Video Infringement Detection via Feature Disentanglement and Mutual
Information Maximization [51.206398602941405]
We propose to disentangle an original high-dimensional feature into multiple sub-features.
On top of the disentangled sub-features, we learn an auxiliary feature to enhance the sub-features.
Our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset.
arXiv Detail & Related papers (2023-09-13T10:53:12Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Redundancy-aware Transformer for Video Question Answering [71.98116071679065]
We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
arXiv Detail & Related papers (2023-08-07T03:16:24Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Temporally coherent video anonymization through GAN inpainting [0.0]
This work tackles the problem of temporally coherent face anonymization in natural video streams.
We propose JaGAN, a two-stage system starting with detecting and masking out faces with black image patches in all individual frames of the video.
Our initial experiments reveal that image based generative models are not capable of inpainting patches showing temporal coherent appearance across neighboring video frames.
arXiv Detail & Related papers (2021-06-04T08:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.