Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise
- URL: http://arxiv.org/abs/2004.13595v1
- Date: Tue, 28 Apr 2020 15:32:45 GMT
- Title: Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise
- Authors: Shan Yang, Yuxuan Wang, Lei Xie
- Abstract summary: Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
- Score: 18.135965605011105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved
extraordinary performance. But a studio-quality corpus with manual
transcription is necessary to train such seq2seq systems. In this paper, we
propose an approach to build high-quality and stable seq2seq based speech
synthesis system using challenging found data, where training speech contains
noisy interferences (acoustic noise) and texts are imperfect speech recognition
transcripts (textual noise). To deal with text-side noise, we propose a VQVAE
based heuristic method to compensate erroneous linguistic feature with phonetic
information learned directly from speech. As for the speech-side noise, we
propose to learn a noise-independent feature in the auto-regressive decoder
through adversarial training and data augmentation, which does not need an
extra speech enhancement model. Experiments show the effectiveness of the
proposed approach in dealing with text-side and speech-side noise. Surpassing
the denoising approach based on a state-of-the-art speech enhancement model,
our system built on noisy found data can synthesize clean and high-quality
speech with MOS close to the system built on the clean counterpart.
Related papers
- Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - Noise-robust zero-shot text-to-speech synthesis conditioned on
self-supervised speech-representation model with adapters [47.75276947690528]
The zero-shot text-to-speech (TTS) method can reproduce speaker characteristics very accurately.
However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise.
In this paper, we propose a noise-robust zero-shot TTS method.
arXiv Detail & Related papers (2024-01-10T12:21:21Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - A$^3$T: Alignment-Aware Acoustic and Text Pretraining for Speech
Synthesis and Editing [31.666920933058144]
We propose our framework, Alignment-Aware Acoustic-Text Pretraining (A$3$T), which reconstructs masked acoustic signals with text input and acoustic-text alignment during training.
Experiments show A$3$T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.
arXiv Detail & Related papers (2022-03-18T01:36:25Z) - Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples.
We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.