FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis
- URL: http://arxiv.org/abs/2207.03800v1
- Date: Fri, 8 Jul 2022 10:10:39 GMT
- Title: FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis
- Authors: Yongqi Wang and Zhou Zhao
- Abstract summary: We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
- Score: 77.06890315052563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unconstrained lip-to-speech synthesis aims to generate corresponding speeches
from silent videos of talking faces with no restriction on head poses or
vocabulary. Current works mainly use sequence-to-sequence models to solve this
problem, either in an autoregressive architecture or a flow-based
non-autoregressive architecture. However, these models suffer from several
drawbacks: 1) Instead of directly generating audios, they use a two-stage
pipeline that first generates mel-spectrograms and then reconstructs audios
from the spectrograms. This causes cumbersome deployment and degradation of
speech quality due to error propagation; 2) The audio reconstruction algorithm
used by these models limits the inference speed and audio quality, while neural
vocoders are not available for these models since their output spectrograms are
not accurate enough; 3) The autoregressive model suffers from high inference
latency, while the flow-based model has high memory occupancy: neither of them
is efficient enough in both time and memory usage. To tackle these problems, we
propose FastLTS, a non-autoregressive end-to-end model which can directly
synthesize high-quality speech audios from unconstrained talking videos with
low latency, and has a relatively small model size. Besides, different from the
widely used 3D-CNN visual frontend for lip movement encoding, we for the first
time propose a transformer-based visual frontend for this task. Experiments
show that our model achieves $19.76\times$ speedup for audio waveform
generation compared with the current autoregressive model on input sequences of
3 seconds, and obtains superior audio quality.
Related papers
- Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference [10.909997817643905]
We present the Low Frame-rate Speech Codec (LFSC): a neural audio that leverages a finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps and 21.5 frames per second.
We demonstrate that our novel LLM can make the inference of text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
arXiv Detail & Related papers (2024-09-18T16:39:10Z) - Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences.
High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking
Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency.
NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video.
We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.