Supervised Contrastive Learning for Accented Speech Recognition
- URL: http://arxiv.org/abs/2107.00921v1
- Date: Fri, 2 Jul 2021 09:23:33 GMT
- Title: Supervised Contrastive Learning for Accented Speech Recognition
- Authors: Tao Han, Hantao Huang, Ziang Yang, Wei Han
- Abstract summary: We study the supervised contrastive learning framework for accented speech recognition.
We show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average.
- Score: 7.5253263976291676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural network based speech recognition systems suffer from performance
degradation due to accented speech, especially unfamiliar accents. In this
paper, we study the supervised contrastive learning framework for accented
speech recognition. To build different views (similar "positive" data samples)
for contrastive learning, three data augmentation techniques including noise
injection, spectrogram augmentation and TTS-same-sentence generation are
further investigated. From the experiments on the Common Voice dataset, we have
shown that contrastive learning helps to build data-augmentation invariant and
pronunciation invariant representations, which significantly outperforms
traditional joint training methods in both zero-shot and full-shot settings.
Experiments show that contrastive learning can improve accuracy by 3.66%
(zero-shot) and 3.78% (full-shot) on average, comparing to the joint training
method.
Related papers
- Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition.
TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions.
This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z) - Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder [22.836016610542387]
This paper introduces a novel framework within an unsupervised setting for learning voice-face associations.
By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner.
Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks.
arXiv Detail & Related papers (2024-04-15T07:05:14Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training [102.18680666349806]
We propose a speed co-augmentation method that randomly changes the playback speeds of both audio and video data.
Experimental results show that the proposed method significantly improves the learned representations when compared to vanilla audio-visual contrastive learning.
arXiv Detail & Related papers (2023-09-25T08:22:30Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Curriculum optimization for low-resource speech recognition [4.803994937990389]
We propose an automated curriculum learning approach to optimize the sequence of training examples.
We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions.
arXiv Detail & Related papers (2022-02-17T19:47:50Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Self-Supervised Learning from Contrastive Mixtures for Personalized
Speech Enhancement [19.645016575334786]
This work explores how self-supervised learning can be universally used to discover speaker-specific features.
We develop a simple contrastive learning procedure which treats the abundant noisy data as makeshift training targets.
arXiv Detail & Related papers (2020-11-06T15:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.