Related papers: Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks

Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks

URL: http://arxiv.org/abs/2103.13329v1
Date: Wed, 10 Mar 2021 17:40:48 GMT
Title: Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative Adversarial Networks
Authors: Md Akmal Haidar and Mehdi Rezagholizadeh
Abstract summary: Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored. We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective. Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
Score: 10.723935272906461
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored for low-resource ASR corpora. GANs help to learn the true data representation through a two-player min-max game. However, training an E2E ASR model using a large ASR corpus with a GAN framework has never been explored, because it might take excessively long time due to high-variance gradient updates and face convergence issues. In this paper, we introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective where the ASR model acts as a generator and a discriminator tries to distinguish the ASR output from the real data. Since the ASR model is pre-trained, we hypothesize that the ASR model output (soft distribution vectors) helps to get higher scores from the discriminator and makes the task of the discriminator harder within our GAN framework, which in turn improves the performance of the ASR model in the fine-tuning stage. Here, the pre-trained ASR model is fine-tuned adversarially against the discriminator using an additional adversarial loss. Experiments on full LibriSpeech dataset show that our proposed approach outperforms baselines and conventional GAN-based adversarial models.

Related papers

Transferable Adversarial Attacks against ASR [43.766547483367795]
We study the vulnerability of practical black-box attacks in cutting-edge automatic speech recognition models. We propose a speech-aware gradient optimization approach (SAGO) for ASR, which forces mistranscription with minimal impact on human imperceptibility. Our comprehensive experimental results reveal performance enhancements compared to baseline approaches across five models on two databases.
arXiv Detail & Related papers (2024-11-14T06:32:31Z)
Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC) We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon. We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z)
DifAugGAN: A Practical Diffusion-style Data Augmentation for GAN-based Single Image Super-resolution [88.13972071356422]
We propose a diffusion-style data augmentation scheme for GAN-based image super-resolution (SR) methods, known as DifAugGAN. It involves adapting the diffusion process in generative diffusion models for improving the calibration of the discriminator during training. Our DifAugGAN can be a Plug-and-Play strategy for current GAN-based SISR methods to improve the calibration of the discriminator and thus improve SR performance.
arXiv Detail & Related papers (2023-11-30T12:37:53Z)
End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements. All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z)
Watch What You Pretrain For: Targeted, Transferable Adversarial Examples on Self-Supervised Speech Recognition models [27.414693266500603]
A targeted adversarial attack produces audio samples that can force an Automatic Speech Recognition system to output attacker-chosen text. Recent work has shown that transferability against large ASR models is very difficult. We show that modern ASR architectures, specifically ones based on Self-Supervised Learning, are in fact vulnerable to transferability.
arXiv Detail & Related papers (2022-09-17T15:01:26Z)
Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS) ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results. In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z)
Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling [76.43479696760996]
We propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR.
arXiv Detail & Related papers (2020-10-12T21:12:56Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.