Personalization of CTC-based End-to-End Speech Recognition Using
Pronunciation-Driven Subword Tokenization
- URL: http://arxiv.org/abs/2310.09988v1
- Date: Mon, 16 Oct 2023 00:06:32 GMT
- Title: Personalization of CTC-based End-to-End Speech Recognition Using
Pronunciation-Driven Subword Tokenization
- Authors: Zhihong Lei, Ernest Pusateri, Shiyi Han, Leo Liu, Mingbin Xu, Tim Ng,
Ruchir Travadi, Youyuan Zhang, Mirko Hannemann, Man-Hung Siu, Zhen Huang
- Abstract summary: We describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification.
We show that using this technique in combination with two established techniques, contextual biasing and wordpiece prior normalization, we are able to achieve personal named entity accuracy on par with a competitive hybrid system.
- Score: 7.259999144975082
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in deep learning and automatic speech recognition have
improved the accuracy of end-to-end speech recognition systems, but recognition
of personal content such as contact names remains a challenge. In this work, we
describe our personalization solution for an end-to-end speech recognition
system based on connectionist temporal classification. Building on previous
work, we present a novel method for generating additional subword tokenizations
for personal entities from their pronunciations. We show that using this
technique in combination with two established techniques, contextual biasing
and wordpiece prior normalization, we are able to achieve personal named entity
accuracy on par with a competitive hybrid system.
Related papers
- InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions [5.50485371072671]
Our method improves the recognition accuracy of misrecognized target keywords by substituting intermediate CTC predictions with corrected labels.
Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
arXiv Detail & Related papers (2024-06-21T06:25:10Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Instant One-Shot Word-Learning for Context-Specific Neural
Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z) - Personalization Strategies for End-to-End Speech Recognition Systems [12.993241217354322]
We show how first and second-pass rescoring strategies can be leveraged together to improve the recognition of personalized words.
We show that such an approach can improve personalized content recognition by up to 16% with minimum degradation on the general use case.
We also describe a novel second-pass de-biasing approach: used in conjunction with a first-pass shallow fusion that optimize on oracle WER.
arXiv Detail & Related papers (2021-02-15T18:36:13Z) - Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.