Personalization for BERT-based Discriminative Speech Recognition
Rescoring
- URL: http://arxiv.org/abs/2307.06832v1
- Date: Thu, 13 Jul 2023 15:54:32 GMT
- Title: Personalization for BERT-based Discriminative Speech Recognition
Rescoring
- Authors: Jari Kolehmainen, Yile Gu, Aditya Gourav, Prashanth Gurunath
Shivakumar, Ankur Gandhe, Ariya Rastrow, Ivan Bulyko
- Abstract summary: 3 novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model.
On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline.
- Score: 13.58828513686159
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognition of personalized content remains a challenge in end-to-end speech
recognition. We explore three novel approaches that use personalized content in
a neural rescoring step to improve recognition: gazetteers, prompting, and a
cross-attention based encoder-decoder model. We use internal de-identified
en-US data from interactions with a virtual voice assistant supplemented with
personalized named entities to compare these approaches. On a test set with
personalized named entities, we show that each of these approaches improves
word error rate by over 10%, against a neural rescoring baseline. We also show
that on this test set, natural language prompts can improve word error rate by
7% without any training and with a marginal loss in generalization. Overall,
gazetteers were found to perform the best with a 10% improvement in word error
rate (WER), while also improving WER on a general test set by 1%.
Related papers
- Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-Labeling [21.82879779173242]
The lack of labeled data is a common challenge in speech classification tasks.
We propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method.
We evaluate our SSL framework on emotion recognition and dementia detection tasks.
arXiv Detail & Related papers (2024-09-25T13:51:19Z) - InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions [5.50485371072671]
Our method improves the recognition accuracy of misrecognized target keywords by substituting intermediate CTC predictions with corrected labels.
Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
arXiv Detail & Related papers (2024-06-21T06:25:10Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - SpellMapper: A non-autoregressive neural spellchecker for ASR
customization with candidate retrieval based on n-gram mappings [76.87664008338317]
Contextual spelling correction models are an alternative to shallow fusion to improve automatic speech recognition.
We propose a novel algorithm for candidate retrieval based on misspelled n-gram mappings.
Experiments on Spoken Wikipedia show 21.4% word error rate improvement compared to a baseline ASR system.
arXiv Detail & Related papers (2023-06-04T10:00:12Z) - A New Benchmark of Aphasia Speech Recognition and Detection Based on
E-Branchformer and Multi-task Learning [29.916793641951507]
This paper presents a new benchmark for Aphasia speech recognition using state-of-the-art speech recognition techniques.
We introduce two multi-task learning methods based on the CTC/Attention architecture to perform both tasks simultaneously.
Our system achieves state-of-the-art speaker-level detection accuracy (97.3%), and a relative WER reduction of 11% for moderate Aphasia patients.
arXiv Detail & Related papers (2023-05-19T15:10:36Z) - Short-Term Word-Learning in a Dynamically Changing Environment [63.025297637716534]
We show how to supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
We demonstrate significant improvements in the detection rate of new words with only a minor increase in false alarms.
arXiv Detail & Related papers (2022-03-29T10:05:39Z) - Personalization Strategies for End-to-End Speech Recognition Systems [12.993241217354322]
We show how first and second-pass rescoring strategies can be leveraged together to improve the recognition of personalized words.
We show that such an approach can improve personalized content recognition by up to 16% with minimum degradation on the general use case.
We also describe a novel second-pass de-biasing approach: used in conjunction with a first-pass shallow fusion that optimize on oracle WER.
arXiv Detail & Related papers (2021-02-15T18:36:13Z) - UniSpeech: Unified Speech Representation Learning with Labeled and
Unlabeled Data [54.733889961024445]
We propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data.
We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus.
arXiv Detail & Related papers (2021-01-19T12:53:43Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.