Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding
- URL: http://arxiv.org/abs/2409.20313v1
- Date: Mon, 30 Sep 2024 14:14:45 GMT
- Title: Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding
- Authors: Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami,
- Abstract summary: Internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition.
IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT.
- Score: 35.29443802809237
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive ASR [17.950722198543897]
We present textbfHybrid-textbfAutoregressive textbfINference TrtextbfANsducers (HAINAN), a novel architecture for speech recognition.
HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor.
arXiv Detail & Related papers (2024-10-03T15:38:20Z) - Covariance-corrected Whitening Alleviates Network Degeneration on Imbalanced Classification [6.197116272789107]
Class imbalance is a critical issue in image classification that significantly affects the performance of deep recognition models.
We propose a novel framework called Whitening-Net to mitigate the degenerate solutions.
In scenarios with extreme class imbalance, the batch covariance statistic exhibits significant fluctuations, impeding the convergence of the whitening operation.
arXiv Detail & Related papers (2024-08-30T10:49:33Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Modular Hybrid Autoregressive Transducer [51.29870462504761]
Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition.
We propose a modular hybrid autoregressive transducer that has structurally separated label and blank decoders.
On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion.
arXiv Detail & Related papers (2022-10-31T03:56:37Z) - Integrate Lattice-Free MMI into End-to-End Speech Recognition [87.01137882072322]
In automatic speech recognition (ASR) research, discriminative criteria have achieved superior performance in DNN-HMM systems.
With this motivation, the adoption of discriminative criteria is promising to boost the performance of end-to-end (E2E) ASR systems.
Previous works have introduced the minimum Bayesian risk (MBR, one of the discriminative criteria) into E2E ASR systems.
In this work, novel algorithms are proposed in this work to integrate another widely used discriminative criterion, lattice-free maximum mutual information (LF-MMI) into E2E
arXiv Detail & Related papers (2022-03-29T14:32:46Z) - Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model [4.490054848527943]
We propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models.
To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
arXiv Detail & Related papers (2021-12-14T09:38:31Z) - A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
arXiv Detail & Related papers (2021-12-02T07:26:34Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.