Text-to-speech for the hearing impaired
- URL: http://arxiv.org/abs/2012.02174v2
- Date: Mon, 22 Mar 2021 12:33:59 GMT
- Title: Text-to-speech for the hearing impaired
- Authors: Josef Schlittenlacher, Thomas Baer
- Abstract summary: Text-to-speech (TTS) systems can compensate for a hearing loss at the source rather than correcting for it at the receiving end.
We propose an algorithm that restores loudness to normal perception at a high resolution in time, frequency and level.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-speech (TTS) systems offer the opportunity to compensate for a
hearing loss at the source rather than correcting for it at the receiving end.
This removes limitations such as time constraints for algorithms that amplify a
sound in a hearing aid and can lead to higher speech quality. We propose an
algorithm that restores loudness to normal perception at a high resolution in
time, frequency and level, and embed it in a TTS system that uses Tacotron2 and
WaveGlow to produce individually amplified speech. Subjective evaluations of
speech quality showed that the proposed algorithm led to high-quality audio
with sound quality similar to original or linearly amplified speech but
considerably higher speech intelligibility in noise. Transfer learning led to a
quick adaptation of the produced spectra from original speech to individually
amplified speech, resulted in high speech quality and intelligibility, and thus
gives us a way to train an individual TTS system efficiently.
Related papers
- An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS [43.84833978193758]
Zero-shot text-to-speech (TTS) systems are capable of synthesizing any speaker's voice from a short audio prompt.
The quality of the generated speech significantly deteriorates when the audio prompt contains noise.
In this paper, we explore various strategies to enhance the quality of audio generated from noisy audio prompts.
arXiv Detail & Related papers (2024-06-09T08:51:50Z) - Noise-robust zero-shot text-to-speech synthesis conditioned on
self-supervised speech-representation model with adapters [47.75276947690528]
The zero-shot text-to-speech (TTS) method can reproduce speaker characteristics very accurately.
However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise.
In this paper, we propose a noise-robust zero-shot TTS method.
arXiv Detail & Related papers (2024-01-10T12:21:21Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech
Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training.
Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - Optimization of a Real-Time Wavelet-Based Algorithm for Improving Speech
Intelligibility [1.0554048699217666]
The discrete-time speech signal is split into frequency sub-bands via a multi-level discrete wavelet transform.
The sub-band gains are adjusted while keeping the overall signal energy unchanged.
The speech intelligibility under various background interference and simulated hearing loss conditions is enhanced.
arXiv Detail & Related papers (2022-02-05T13:03:57Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.