Robust Automatic Speech Recognition via WavAugment Guided Phoneme
Adversarial Training
- URL: http://arxiv.org/abs/2307.12498v1
- Date: Mon, 24 Jul 2023 03:07:40 GMT
- Title: Robust Automatic Speech Recognition via WavAugment Guided Phoneme
Adversarial Training
- Authors: Gege Qi, Yuefeng Chen, Xiaofeng Mao, Xiaojun Jia, Ranjie Duan, Rong
Zhang, Hui Xue
- Abstract summary: We propose a novel WavAugment Guided Phoneme Adrial Training (wapat)
wapat use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation.
In addition, wapat utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions.
- Score: 20.33516009339207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing a practically-robust automatic speech recognition (ASR) is
challenging since the model should not only maintain the original performance
on clean samples, but also achieve consistent efficacy under small volume
perturbations and large domain shifts. To address this problem, we propose a
novel WavAugment Guided Phoneme Adversarial Training (wapat). wapat use
adversarial examples in phoneme space as augmentation to make the model
invariant to minor fluctuations in phoneme representation and preserve the
performance on clean samples. In addition, wapat utilizes the phoneme
representation of augmented samples to guide the generation of adversaries,
which helps to find more stable and diverse gradient-directions, resulting in
improved generalization. Extensive experiments demonstrate the effectiveness of
wapat on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-wapat
outperforms the original model by 6.28% WER reduction on ESB, achieving the new
state-of-the-art.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - DiffVoice: Text-to-Speech with Latent Diffusion [18.150627638754923]
We present DiffVoice, a novel text-to-speech model based on latent diffusion.
Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness.
arXiv Detail & Related papers (2023-04-23T21:05:33Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - Data Augmentation based Consistency Contrastive Pre-training for
Automatic Speech Recognition [18.303072203996347]
Self-supervised acoustic pre-training has achieved amazing results on the automatic speech recognition (ASR) task.
Most of the successful acoustic pre-training methods use contrastive learning to learn the acoustic representations.
In this letter, we design a novel consistency contrastive learning (CCL) method by utilizing data augmentation for acoustic pre-training.
arXiv Detail & Related papers (2021-12-23T13:23:17Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.