Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition
- URL: http://arxiv.org/abs/2110.04934v1
- Date: Mon, 11 Oct 2021 00:08:48 GMT
- Title: Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition
- Authors: Yiming Wang, Jinyu Li, Heming Wang, Yao Qian, Chengyi Wang, Yu Wu
- Abstract summary: We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
- Score: 52.71604809100364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of self-supervised learning (SSL) for automatic speech recognition
(ASR) is to learn good speech representations from a large amount of unlabeled
speech for the downstream ASR task. However, most SSL frameworks do not
consider noise robustness which is crucial for real-world applications. In this
paper we propose wav2vec-Switch, a method to encode noise robustness into
contextualized representations of speech via contrastive learning.
Specifically, we feed original-noisy speech pairs simultaneously into the
wav2vec 2.0 network. In addition to the existing contrastive learning task, we
switch the quantized representations of the original and noisy speech as
additional prediction targets of each other. By doing this, it enforces the
network to have consistent predictions for the original and noisy speech, thus
allows to learn contextualized representation with noise robustness. Our
experiments on synthesized and real noisy data show the effectiveness of our
method: it achieves 2.9--4.9% relative word error rate (WER) reduction on the
synthesized noisy LibriSpeech data without deterioration on the original data,
and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation
baseline even with a strong language model for decoding. Our results on CHiME-4
can match or even surpass those with well-designed speech enhancement
components.
Related papers
- Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - AV2Wav: Diffusion-Based Re-synthesis from Continuous Self-supervised Features for Audio-Visual Speech Enhancement [18.193191170754744]
We introduce AV2Wav, a re-synthesis-based audio-visual speech enhancement approach.
We use continuous rather than discrete representations to retain prosody and speaker information.
Our approach outperforms a masking-based baseline in terms of both automatic metrics and a human listening test.
arXiv Detail & Related papers (2023-09-14T21:07:53Z) - Improving the Gap in Visual Speech Recognition Between Normal and Silent
Speech Based on Metric Learning [11.50011780498048]
This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR)
We propose to leverage the shared literal content between normal and silent speech and present a metric learning approach based on visemes.
Our evaluation demonstrates that our method improves the accuracy of silent VSR, even when limited training data is available.
arXiv Detail & Related papers (2023-05-23T16:20:46Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition [26.77806246793544]
Speech enhancement (SE) is introduced as front-end to reduce noise for ASR, but it also suppresses some important speech information.
We propose a dual-path style learning approach for end-to-end noise-robust speech recognition (DPSL-ASR)
Experiments show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline.
arXiv Detail & Related papers (2022-03-28T15:21:57Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Adversarial Feature Learning and Unsupervised Clustering based Speech
Synthesis for Found Data with Acoustic and Textual Noise [18.135965605011105]
Attention-based sequence-to-sequence (seq2seq) speech synthesis has achieved extraordinary performance.
A studio-quality corpus with manual transcription is necessary to train such seq2seq systems.
We propose an approach to build high-quality and stable seq2seq based speech synthesis system using challenging found data.
arXiv Detail & Related papers (2020-04-28T15:32:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.