Self-Supervised Learning for Speech Enhancement through Synthesis
- URL: http://arxiv.org/abs/2211.02542v1
- Date: Fri, 4 Nov 2022 16:06:56 GMT
- Title: Self-Supervised Learning for Speech Enhancement through Synthesis
- Authors: Bryce Irvin, Marko Stamenovic, Mikolaj Kegler, Li-Chia Yang
- Abstract summary: We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
- Score: 5.924928860260821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern speech enhancement (SE) networks typically implement noise suppression
through time-frequency masking, latent representation masking, or
discriminative signal prediction. In contrast, some recent works explore SE via
generative speech synthesis, where the system's output is synthesized by a
neural vocoder after an inherently lossy feature-denoising step. In this paper,
we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy
representations and learns to directly synthesize clean speech. We leverage
rich representations from self-supervised learning (SSL) speech models to
discover relevant features. We conduct a candidate search across 15 potential
SSL front-ends and subsequently train our vocoder adversarially with the best
SSL configuration. Additionally, we demonstrate a causal version capable of
running on streaming audio with 10ms latency and minimal performance
degradation. Finally, we conduct both objective evaluations and subjective
listening studies to show our system improves objective metrics and outperforms
an existing state-of-the-art SE model subjectively.
Related papers
- Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser.
We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised.
We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z) - AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.