Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition
- URL: http://arxiv.org/abs/2305.05271v1
- Date: Tue, 9 May 2023 08:51:44 GMT
- Title: Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition
- Authors: Xuandi Fu, Kanthashree Mysore Sathyendra, Ankur Gandhe, Jing Liu,
Grant P. Strimel, Ross McGowan, Athanasios Mouchtaris
- Abstract summary: We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
- Score: 14.744220870243932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Attention-based contextual biasing approaches have shown significant
improvements in the recognition of generic and/or personal rare-words in
End-to-End Automatic Speech Recognition (E2E ASR) systems like neural
transducers. These approaches employ cross-attention to bias the model towards
specific contextual entities injected as bias-phrases to the model. Prior
approaches typically relied on subword encoders for encoding the bias phrases.
However, subword tokenizations are coarse and fail to capture granular
pronunciation information which is crucial for biasing based on acoustic
similarity. In this work, we propose to use lightweight character
representations to encode fine-grained pronunciation features to improve
contextual biasing guided by acoustic similarity between the audio and the
contextual entities (termed acoustic biasing). We further integrate pretrained
neural language model (NLM) based encoders to encode the utterance's semantic
context along with contextual entities to perform biasing informed by the
utterance's semantic context (termed semantic biasing). Experiments using a
Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26%
relative WER improvement on different biasing list sizes over the baseline
contextual model when incorporating our proposed acoustic and semantic biasing
approach. On a large-scale in-house dataset, we observe 7.91% relative WER
improvement compared to our baseline model. On tail utterances, the
improvements are even more pronounced with 36.80% and 23.40% relative WER
improvements on Librispeech rare words and an in-house testset respectively.
Related papers
- Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains.
We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z) - Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Adaptive Contextual Biasing for Transducer Based Streaming Speech
Recognition [21.90433428015086]
deep biasing methods have emerged as a promising solution for speech recognition of personalized words.
For real-world voice assistants, always biasing on such words with high prediction scores can significantly degrade the performance of recognizing common words.
We propose an adaptive contextual bias based Context-Aware Transformer (CATT) that utilizes the biased encoder and predictors to perform streaming prediction of contextual phrase occurrences.
arXiv Detail & Related papers (2023-06-01T15:33:30Z) - CB-Conformer: Contextual biasing Conformer for biased word recognition [33.28780163232423]
We introduce the Contextual Biasing Module and the Self-Adaptive Language Model to vanilla Conformer.
Our proposed method brings a 15.34% character error rate reduction, a 14.13% biased word recall increase, and a 6.80% biased word F1-score increase compared with the base Conformer.
arXiv Detail & Related papers (2023-04-19T12:26:04Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Speaker Embedding-aware Neural Diarization for Flexible Number of
Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels.
Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.