Don't Stop Self-Supervision: Accent Adaptation of Speech Representations
via Residual Adapters
- URL: http://arxiv.org/abs/2307.00453v1
- Date: Sun, 2 Jul 2023 02:21:29 GMT
- Title: Don't Stop Self-Supervision: Accent Adaptation of Speech Representations
via Residual Adapters
- Authors: Anshu Bhatia, Sanchit Sinha, Saket Dingliwal, Karthik Gopalakrishnan,
Sravan Bodapati, Katrin Kirchhoff
- Abstract summary: Speech representations learned in a self-supervised fashion from massive unlabeled speech corpora have been adapted successfully toward several downstream tasks.
We propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific adapters.
We obtain strong word error rate reductions (WERR) over HuBERT-large for all 4 accents, with a mean WERR of 22.7% with accent-specific adapters and a mean WERR of 25.1% if the entire encoder is accent-adapted.
- Score: 14.645374377673148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech representations learned in a self-supervised fashion from massive
unlabeled speech corpora have been adapted successfully toward several
downstream tasks. However, such representations may be skewed toward canonical
data characteristics of such corpora and perform poorly on atypical, non-native
accented speaker populations. With the state-of-the-art HuBERT model as a
baseline, we propose and investigate self-supervised adaptation of speech
representations to such populations in a parameter-efficient way via training
accent-specific residual adapters. We experiment with 4 accents and choose
automatic speech recognition (ASR) as the downstream task of interest. We
obtain strong word error rate reductions (WERR) over HuBERT-large for all 4
accents, with a mean WERR of 22.7% with accent-specific adapters and a mean
WERR of 25.1% if the entire encoder is accent-adapted. While our experiments
utilize HuBERT and ASR as the downstream task, our proposed approach is both
model and task-agnostic.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - USAT: A Universal Speaker-Adaptive Text-to-Speech Approach [11.022840133207788]
challenge of neglecting lifelike speech for unseen, out-of-dataset speakers remains significant and unresolved.
Zero-shot approaches suffer from insufficient generalization performance to reproduce the voice of speakers with heavy accents.
Few-shot methods can reproduce highly varying accents, bringing a significant storage burden and the risk of overfitting and catastrophic forgetting.
Our proposed framework unifies both zero-shot and few-shot speaker adaptation strategies, which we term as "instant" and "fine-grained" adaptations based on their merits.
arXiv Detail & Related papers (2024-04-28T06:50:55Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and
Accented Speech [5.960279280033886]
We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning.
We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
arXiv Detail & Related papers (2021-09-14T20:04:47Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Black-box Adaptation of ASR for Accented Speech [52.63060669715216]
We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent.
We propose a novel coupling of an open-source accent-tuned local model with the black-box service.
Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies.
arXiv Detail & Related papers (2020-06-24T07:07:49Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.