Multi-rate attention architecture for fast streamable Text-to-speech
spectrum modeling
- URL: http://arxiv.org/abs/2104.00705v1
- Date: Thu, 1 Apr 2021 18:15:30 GMT
- Title: Multi-rate attention architecture for fast streamable Text-to-speech
spectrum modeling
- Authors: Qing He, Zhiping Xiu, Thilo Koehler, Jilong Wu
- Abstract summary: High quality text-to-speech (TTS) systems use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio.
While these models can produce high quality speech, they often incur OL$ in both latency and real-time factor (RTF) with respect to input length.
We propose a multi-rate architecture that breaks the latency bottlenecks by encoding a compact representation during streaming.
- Score: 5.080331097831114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typical high quality text-to-speech (TTS) systems today use a two-stage
architecture, with a spectrum model stage that generates spectral frames and a
vocoder stage that generates the actual audio. High-quality spectrum models
usually incorporate the encoder-decoder architecture with self-attention or
bi-directional long short-term (BLSTM) units. While these models can produce
high quality speech, they often incur O($L$) increase in both latency and
real-time factor (RTF) with respect to input length $L$. In other words, longer
inputs leads to longer delay and slower synthesis speed, limiting its use in
real-time applications. In this paper, we propose a multi-rate attention
architecture that breaks the latency and RTF bottlenecks by computing a compact
representation during encoding and recurrently generating the attention vector
in a streaming manner during decoding. The proposed architecture achieves high
audio quality (MOS of 4.31 compared to groundtruth 4.48), low latency, and low
RTF at the same time. Meanwhile, both latency and RTF of the proposed system
stay constant regardless of input lengths, making it ideal for real-time
applications.
Related papers
- RT-LA-VocE: Real-Time Low-SNR Audio-Visual Speech Enhancement [36.10772098876638]
We propose RT-LA-VocE, which re-designs every component of LA-VocE to perform causal real-time inference with a 40ms input frame.
We show that our algorithm achieves state-of-the-art results in all real-time scenarios.
arXiv Detail & Related papers (2024-07-10T16:49:23Z) - Cross-layer scheme for low latency multiple description video streaming
over Vehicular Ad-hoc NETworks (VANETs) [2.2124180701409233]
HEVC standard is very promising for real-time video streaming.
New state-of-the-art video coding (HEVC) standard is very promising for real-time video streaming.
We propose an original cross-layer system in order to enhance received video quality in vehicular communications.
arXiv Detail & Related papers (2023-11-05T14:34:58Z) - FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net
Encoder With Multiple STFTs [1.8047694351309207]
FastFit is a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs)
We show that FastFit achieves nearly twice the generation speed of baseline-based vocoders while maintaining high sound quality.
arXiv Detail & Related papers (2023-05-18T09:05:17Z) - Framewise WaveGAN: High Speed Adversarial Vocoder in Time Domain with
Very Low Computational Complexity [23.49462995118466]
Framewise WaveGAN vocoder achieves higher quality than auto-regressive maximum-likelihood vocoders such as LPCNet at a very low complexity of 1.2 GFLOPS.
This makes GAN vocoders more practical on edge and low-power devices.
arXiv Detail & Related papers (2022-12-08T19:38:34Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - PINs: Progressive Implicit Networks for Multi-Scale Neural
Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings.
Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail.
Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z) - High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency [3.119625275101153]
System is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation.
Full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
arXiv Detail & Related papers (2021-11-17T11:46:43Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.