Personalization Strategies for End-to-End Speech Recognition Systems
- URL: http://arxiv.org/abs/2102.07739v1
- Date: Mon, 15 Feb 2021 18:36:13 GMT
- Title: Personalization Strategies for End-to-End Speech Recognition Systems
- Authors: Aditya Gourav, Linda Liu, Ankur Gandhe, Yile Gu, Guitang Lan,
Xiangyang Huang, Shashank Kalmane, Gautam Tiwari, Denis Filimonov, Ariya
Rastrow, Andreas Stolcke, Ivan Bulyko
- Abstract summary: We show how first and second-pass rescoring strategies can be leveraged together to improve the recognition of personalized words.
We show that such an approach can improve personalized content recognition by up to 16% with minimum degradation on the general use case.
We also describe a novel second-pass de-biasing approach: used in conjunction with a first-pass shallow fusion that optimize on oracle WER.
- Score: 12.993241217354322
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The recognition of personalized content, such as contact names, remains a
challenging problem for end-to-end speech recognition systems. In this work, we
demonstrate how first and second-pass rescoring strategies can be leveraged
together to improve the recognition of such words. Following previous work, we
use a shallow fusion approach to bias towards recognition of personalized
content in the first-pass decoding. We show that such an approach can improve
personalized content recognition by up to 16% with minimum degradation on the
general use case. We describe a fast and scalable algorithm that enables our
biasing models to remain at the word-level, while applying the biasing at the
subword level. This has the advantage of not requiring the biasing models to be
dependent on any subword symbol table. We also describe a novel second-pass
de-biasing approach: used in conjunction with a first-pass shallow fusion that
optimizes on oracle WER, we can achieve an additional 14% improvement on
personalized content recognition, and even improve accuracy for the general use
case by up to 2.5%.
Related papers
- InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions [5.50485371072671]
Our method improves the recognition accuracy of misrecognized target keywords by substituting intermediate CTC predictions with corrected labels.
Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
arXiv Detail & Related papers (2024-06-21T06:25:10Z) - Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance [62.15866177242207]
We show that through constructing a subject-agnostic condition, one could obtain outputs consistent with both the given subject and input text prompts.
Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements.
arXiv Detail & Related papers (2024-05-02T15:03:41Z) - Personalization of CTC-based End-to-End Speech Recognition Using
Pronunciation-Driven Subword Tokenization [7.259999144975082]
We describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification.
We show that using this technique in combination with two established techniques, contextual biasing and wordpiece prior normalization, we are able to achieve personal named entity accuracy on par with a competitive hybrid system.
arXiv Detail & Related papers (2023-10-16T00:06:32Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - Personalization for BERT-based Discriminative Speech Recognition
Rescoring [13.58828513686159]
3 novel approaches that use personalized content in a neural rescoring step to improve recognition: gazetteers, prompting, and a cross-attention based encoder-decoder model.
On a test set with personalized named entities, we show that each of these approaches improves word error rate by over 10%, against a neural rescoring baseline.
arXiv Detail & Related papers (2023-07-13T15:54:32Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Transformer-Based Approach for Joint Handwriting and Named Entity
Recognition in Historical documents [1.7491858164568674]
This work presents the first approach that adopts the transformer networks for named entity recognition in handwritten documents.
We achieve the new state-of-the-art performance in the ICDAR 2017 Information Extraction competition using the Esposalles database.
arXiv Detail & Related papers (2021-12-08T09:26:21Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.