High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency
- URL: http://arxiv.org/abs/2111.09052v1
- Date: Wed, 17 Nov 2021 11:46:43 GMT
- Title: High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency
- Authors: Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos Markopoulos,
Aimilios Chalamandaris, Georgia Maniati, Panos Kakoulidis, Spyros Raptis,
June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis
- Abstract summary: System is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation.
Full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
- Score: 3.119625275101153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents an end-to-end text-to-speech system with low latency on a
CPU, suitable for real-time applications. The system is composed of an
autoregressive attention-based sequence-to-sequence acoustic model and the
LPCNet vocoder for waveform generation. An acoustic model architecture that
adopts modules from both the Tacotron 1 and 2 models is proposed, while
stability is ensured by using a recently proposed purely location-based
attention mechanism, suitable for arbitrary sentence length generation. During
inference, the decoder is unrolled and acoustic feature generation is performed
in a streaming manner, allowing for a nearly constant latency which is
independent from the sentence length. Experimental results show that the
acoustic model can produce feature sequences with minimal latency about 31
times faster than real-time on a computer CPU and 6.5 times on a mobile CPU,
enabling it to meet the conditions required for real-time applications on both
devices. The full end-to-end system can generate almost natural quality speech,
which is verified by listening tests.
Related papers
- Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech
Synthesis [77.06890315052563]
We propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency.
Experiments show that our model achieves $19.76times$ speedup for audio generation compared with the current autoregressive model on input sequences of 3 seconds.
arXiv Detail & Related papers (2022-07-08T10:10:39Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - PickNet: Real-Time Channel Selection for Ad Hoc Microphone Arrays [15.788867107071244]
PickNet is a neural network model for real-time channel selection for an ad hoc microphone array consisting of multiple recording devices like cell phones.
The proposed model yielded significant gains in word error rate with limited computational cost over systems using a block-online beamformer and a single distant microphone.
arXiv Detail & Related papers (2022-01-24T10:52:43Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Multi-rate attention architecture for fast streamable Text-to-speech
spectrum modeling [5.080331097831114]
High quality text-to-speech (TTS) systems use a two-stage architecture, with a spectrum model stage that generates spectral frames and a vocoder stage that generates the actual audio.
While these models can produce high quality speech, they often incur OL$ in both latency and real-time factor (RTF) with respect to input length.
We propose a multi-rate architecture that breaks the latency bottlenecks by encoding a compact representation during streaming.
arXiv Detail & Related papers (2021-04-01T18:15:30Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.