Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks
- URL: http://arxiv.org/abs/2103.13329v1
- Date: Wed, 10 Mar 2021 17:40:48 GMT
- Title: Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks
- Authors: Md Akmal Haidar and Mehdi Rezagholizadeh
- Abstract summary: Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
- Score: 10.723935272906461
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Adversarial training of end-to-end (E2E) ASR systems using generative
adversarial networks (GAN) has recently been explored for low-resource ASR
corpora. GANs help to learn the true data representation through a two-player
min-max game. However, training an E2E ASR model using a large ASR corpus with
a GAN framework has never been explored, because it might take excessively long
time due to high-variance gradient updates and face convergence issues. In this
paper, we introduce a novel framework for fine-tuning a pre-trained ASR model
using the GAN objective where the ASR model acts as a generator and a
discriminator tries to distinguish the ASR output from the real data. Since the
ASR model is pre-trained, we hypothesize that the ASR model output (soft
distribution vectors) helps to get higher scores from the discriminator and
makes the task of the discriminator harder within our GAN framework, which in
turn improves the performance of the ASR model in the fine-tuning stage. Here,
the pre-trained ASR model is fine-tuned adversarially against the discriminator
using an additional adversarial loss. Experiments on full LibriSpeech dataset
show that our proposed approach outperforms baselines and conventional
GAN-based adversarial models.
Related papers
- Transferable Adversarial Attacks against ASR [43.766547483367795]
We study the vulnerability of practical black-box attacks in cutting-edge automatic speech recognition models.
We propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility.
Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
arXiv Detail & Related papers (2024-11-14T06:32:31Z) - Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC)
We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon.
We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z) - DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based
Single Image Super-resolution [88.13972071356422]
We propose a diffusion-style data augmentation scheme for GAN-based image super-resolution (SR) methods, known as DifAugGAN.
It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training.
Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance.
arXiv Detail & Related papers (2023-11-30T12:37:53Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Watch What You Pretrain For: Targeted, Transferable Adversarial Examples
on Self-Supervised Speech Recognition models [27.414693266500603]
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition system to output attacker-chosen text.
Recent work has shown that transferability against large ASR models is very difficult.
We show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability.
arXiv Detail & Related papers (2022-09-17T15:01:26Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution.
This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes.
Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z) - Dual-mode ASR: Unify and Improve Streaming ASR with Full-context
Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition.
We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.