Enhancing and Adversarial: Improve ASR with Speaker Labels
- URL: http://arxiv.org/abs/2211.06369v1
- Date: Fri, 11 Nov 2022 17:40:08 GMT
- Title: Enhancing and Adversarial: Improve ASR with Speaker Labels
- Authors: Wei Zhou, Haotian Wu, Jingjing Xu, Mohammad Zeineldeen, Christoph
L\"uscher, Ralf Schl\"uter, Hermann Ney
- Abstract summary: We propose a novel adaptive gradient reversal layer for stable and effective adversarial training without tuning effort.
Detailed analysis and experimental verification are conducted to show the optimal positions in the ASR neural network (NN) to apply speaker enhancing and adversarial training.
Our best speaker-based MTL achieves 7% relative improvement on the Switchboard Hub5'00 set.
- Score: 49.73714831258699
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ASR can be improved by multi-task learning (MTL) with domain enhancing or
domain adversarial training, which are two opposite objectives with the aim to
increase/decrease domain variance towards domain-aware/agnostic ASR,
respectively. In this work, we study how to best apply these two opposite
objectives with speaker labels to improve conformer-based ASR. We also propose
a novel adaptive gradient reversal layer for stable and effective adversarial
training without tuning effort. Detailed analysis and experimental verification
are conducted to show the optimal positions in the ASR neural network (NN) to
apply speaker enhancing and adversarial training. We also explore their
combination for further improvement, achieving the same performance as
i-vectors plus adversarial training. Our best speaker-based MTL achieves 7\%
relative improvement on the Switchboard Hub5'00 set. We also investigate the
effect of such speaker-based MTL w.r.t. cleaner dataset and weaker ASR NN.
Related papers
- Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning [6.60571587618006]
Radio speech echo is a specific phenomenon in the air traffic control (ATC) domain, which degrades speech quality and impacts automatic speech recognition (ASR) accuracy.
In this work, a time-domain recognition-oriented speech enhancement framework is proposed to improve speech intelligibility and advance ASR accuracy.
The framework serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model.
arXiv Detail & Related papers (2023-12-11T04:51:41Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust
Speech Recognition [23.042478625584653]
gradient remedy (GR) is a simple yet effective approach to solve interference between task gradients in noise-robust speech recognition.
We show that the proposed approach well resolves the gradient interference and relative word error rate (WER) reductions of 9.3% and 11.1% over multi-task learning baseline.
arXiv Detail & Related papers (2023-02-22T13:31:13Z) - Effect and Analysis of Large-scale Language Model Rescoring on
Competitive ASR Systems [30.873546090458678]
Large-scale language models (LLMs) have been successfully applied to ASR N-best rescoring.
In this study, we incorporate LLM rescoring into one of the most competitive ASR baselines: the Conformer-Transducer model.
arXiv Detail & Related papers (2022-04-01T05:20:55Z) - Automated Audio Captioning using Transfer Learning and Reconstruction
Latent Space Similarity Regularization [21.216783537997426]
We propose an architecture that is able to better leverage the acoustic features provided by PANNs for the Automated Audio Captioning Task.
We also introduce a novel self-supervised objective, Reconstruction Latent Space Similarity Regularization (RLSSR)
arXiv Detail & Related papers (2021-08-10T13:49:41Z) - Transferable Deep Reinforcement Learning Framework for Autonomous
Vehicles with Joint Radar-Data Communications [69.24726496448713]
We propose an intelligent optimization framework based on the Markov Decision Process (MDP) to help the AV make optimal decisions.
We then develop an effective learning algorithm leveraging recent advances of deep reinforcement learning techniques to find the optimal policy for the AV.
We show that the proposed transferable deep reinforcement learning framework reduces the obstacle miss detection probability by the AV up to 67% compared to other conventional deep reinforcement learning approaches.
arXiv Detail & Related papers (2021-05-28T08:45:37Z) - Improving RNN Transducer Based ASR with Auxiliary Tasks [21.60022481898402]
End-to-end automatic speech recognition (ASR) models with a single neural network have recently demonstrated state-of-the-art results.
In this work, we examine ways in which recurrent neural network transducer (RNN-T) can achieve better ASR accuracy via performing auxiliary tasks.
arXiv Detail & Related papers (2020-11-05T21:46:32Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.