Compact Neural TTS Voices for Accessibility
- URL: http://arxiv.org/abs/2501.17332v1
- Date: Tue, 28 Jan 2025 22:51:14 GMT
- Title: Compact Neural TTS Voices for Accessibility
- Authors: Kunal Jain, Eoin Murphy, Deepanshu Gupta, Jonathan Dyke, Saumya Shah, Vasilieios Tsiaras, Petko Petkov, Alistair Conkie,
- Abstract summary: Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness.
More recently, neural TTS models were made deployable to run on handheld devices.
We describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint.
- Score: 1.5558822250482192
- License:
- Abstract: Contemporary text-to-speech solutions for accessibility applications can typically be classified into two categories: (i) device-based statistical parametric speech synthesis (SPSS) or unit selection (USEL) and (ii) cloud-based neural TTS. SPSS and USEL offer low latency and low disk footprint at the expense of naturalness and audio quality. Cloud-based neural TTS systems provide significantly better audio quality and naturalness but regress in terms of latency and responsiveness, rendering these impractical for real-world applications. More recently, neural TTS models were made deployable to run on handheld devices. Nevertheless, latency remains higher than SPSS and USEL, while disk footprint prohibits pre-installation for multiple voices at once. In this work, we describe a high-quality compact neural TTS system achieving latency on the order of 15 ms with low disk footprint. The proposed solution is capable of running on low-power devices.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection
with Spiking Neural Networks [51.516451451719654]
Spiking Neural Networks (SNNs) are known to be biologically plausible and power-efficient.
This paper introduces a novel SNN-based Voice Activity Detection model, referred to as sVAD.
It provides effective auditory feature representation through SincNet and 1D convolution, and improves noise robustness with attention mechanisms.
arXiv Detail & Related papers (2024-03-09T02:55:44Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud
to Edge [3.612475016403612]
Bunched LPCNet2 provides efficient performance in high-quality for cloud servers and in a low-complexity for low-resource edge devices.
Experiments demonstrate that Bunched LPCNet2 generates satisfactory speech quality with a model footprint of 1.1MB while operating faster than real-time on a RPi 3B.
arXiv Detail & Related papers (2022-03-27T23:56:52Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - High Quality Streaming Speech Synthesis with Low,
Sentence-Length-Independent Latency [3.119625275101153]
System is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation.
Full end-to-end system can generate almost natural quality speech, which is verified by listening tests.
arXiv Detail & Related papers (2021-11-17T11:46:43Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z) - LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search [127.56834100382878]
We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech.
Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
arXiv Detail & Related papers (2021-02-08T07:45:06Z) - Enhancing Speech Intelligibility in Text-To-Speech Synthesis using
Speaking Style Conversion [17.520533341887642]
We propose a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis.
The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC)
Intelligibility enhancement as quantified by the Intelligibility in Bits measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN) and 47% to 140% in competing-speaker noise (CSN)
arXiv Detail & Related papers (2020-08-13T10:51:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.