Towards Realistic Visual Dubbing with Heterogeneous Sources
- URL: http://arxiv.org/abs/2201.06260v1
- Date: Mon, 17 Jan 2022 07:57:24 GMT
- Title: Towards Realistic Visual Dubbing with Heterogeneous Sources
- Authors: Tianyi Xie, Liucheng Liao, Cheng Bi, Benlai Tang, Xiang Yin, Jianfei
Yang, Mingjie Wang, Jiali Yao, Yang Zhang, Zejun Ma
- Abstract summary: Few-shot visual dubbing involves synchronizing the lip movements with arbitrary speech input for any talking head.
We propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data.
Our method makes it possible to independently utilize the training corpus for two-stage sub-networks.
- Score: 22.250010330418398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of few-shot visual dubbing focuses on synchronizing the lip
movements with arbitrary speech input for any talking head video. Albeit
moderate improvements in current approaches, they commonly require high-quality
homologous data sources of videos and audios, thus causing the failure to
leverage heterogeneous data sufficiently. In practice, it may be intractable to
collect the perfect homologous data in some cases, for example, audio-corrupted
or picture-blurry videos. To explore this kind of data and support
high-fidelity few-shot visual dubbing, in this paper, we novelly propose a
simple yet efficient two-stage framework with a higher flexibility of mining
heterogeneous data. Specifically, our two-stage paradigm employs facial
landmarks as intermediate prior of latent representations and disentangles the
lip movements prediction from the core task of realistic talking head
generation. By this means, our method makes it possible to independently
utilize the training corpus for two-stage sub-networks using more available
heterogeneous data easily acquired. Besides, thanks to the disentanglement, our
framework allows a further fine-tuning for a given talking head, thereby
leading to better speaker-identity preserving in the final synthesized results.
Moreover, the proposed method can also transfer appearance features from others
to the target speaker. Extensive experimental results demonstrate the
superiority of our proposed method in generating highly realistic videos
synchronized with the speech over the state-of-the-art.
Related papers
- High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
Priors [18.904856604045264]
We introduce a simple and novel framework for one-shot audio-driven talking head generation.
We probabilistically sample all the holistic lip-irrelevant facial motions to semantically match the input audio.
Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences.
arXiv Detail & Related papers (2022-12-07T17:55:41Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - WavThruVec: Latent speech representation as intermediate features for
neural speech synthesis [1.1470070927586016]
WavThruVec is a two-stage architecture that resolves the bottleneck by using high-dimensional Wav2Vec 2.0 embeddings as intermediate speech representation.
We show that the proposed model not only matches the quality of state-of-the-art neural models, but also presents useful properties enabling tasks like voice conversion or zero-shot synthesis.
arXiv Detail & Related papers (2022-03-31T10:21:08Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.