Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End
Speech Recognition
- URL: http://arxiv.org/abs/2302.09723v2
- Date: Tue, 21 Feb 2023 09:44:33 GMT
- Title: Emphasizing Unseen Words: New Vocabulary Acquisition for End-to-End
Speech Recognition
- Authors: Leyuan Qu, Cornelius Weber and Stefan Wermter
- Abstract summary: Out-Of-Vocabulary words, such as trending words and new named entities, pose problems to modern ASR systems.
We propose to generate OOV words using text-to-speech systems and to rescale losses to encourage neural networks to pay more attention to OOV words.
- Score: 21.61242091927018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the dynamic nature of human language, automatic speech recognition
(ASR) systems need to continuously acquire new vocabulary. Out-Of-Vocabulary
(OOV) words, such as trending words and new named entities, pose problems to
modern ASR systems that require long training times to adapt their large
numbers of parameters. Different from most previous research focusing on
language model post-processing, we tackle this problem on an earlier processing
level and eliminate the bias in acoustic modeling to recognize OOV words
acoustically. We propose to generate OOV words using text-to-speech systems and
to rescale losses to encourage neural networks to pay more attention to OOV
words. Specifically, we enlarge the classification loss used for training
neural networks' parameters of utterances containing OOV words
(sentence-level), or rescale the gradient used for back-propagation for OOV
words (word-level), when fine-tuning a previously trained model on synthetic
audio. To overcome catastrophic forgetting, we also explore the combination of
loss rescaling and model regularization, i.e. L2 regularization and elastic
weight consolidation (EWC). Compared with previous methods that just fine-tune
synthetic audio with EWC, the experimental results on the LibriSpeech benchmark
reveal that our proposed loss rescaling approach can achieve significant
improvement on the recall rate with only a slight decrease on word error rate.
Moreover, word-level rescaling is more stable than utterance-level rescaling
and leads to higher recall rates and precision on OOV word recognition.
Furthermore, our proposed combined loss rescaling and weight consolidation
methods can support continual learning of an ASR system.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Surrogate Gradient Spiking Neural Networks as Encoders for Large
Vocabulary Continuous Speech Recognition [91.39701446828144]
We show that spiking neural networks can be trained like standard recurrent neural networks using the surrogate gradient method.
They have shown promising results on speech command recognition tasks.
In contrast to their recurrent non-spiking counterparts, they show robustness to exploding gradient problems without the need to use gates.
arXiv Detail & Related papers (2022-12-01T12:36:26Z) - Context-based out-of-vocabulary word recovery for ASR systems in Indian
languages [5.930734371401316]
We propose a post-processing technique to improve the performance of context-based OOV recovery.
The effectiveness of the proposed cost function is evaluated at both word-level and sentence-level.
arXiv Detail & Related papers (2022-06-09T06:51:31Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Frequency-Aware Contrastive Learning for Neural Machine Translation [24.336356651877388]
Low-frequency word prediction remains a challenge in modern neural machine translation (NMT) systems.
Inspired by the observation that low-frequency words form a more compact embedding space, we tackle this challenge from a representation learning perspective.
We propose a frequency-aware token-level contrastive learning method, in which the hidden state of each decoding step is pushed away from the counterparts of other target words.
arXiv Detail & Related papers (2021-12-29T10:10:10Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.