Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn
Medical Interview
- URL: http://arxiv.org/abs/2403.00370v1
- Date: Fri, 1 Mar 2024 08:53:52 GMT
- Title: Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn
Medical Interview
- Authors: Heyang Liu, Yu Wang and Yanfeng Wang
- Abstract summary: End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks.
We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions.
In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.
- Score: 26.823126615724888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) approach is gradually replacing hybrid models for automatic
speech recognition (ASR) tasks. However, the optimization of E2E models lacks
an intuitive method for handling decoding shifts, especially in scenarios with
a large number of domain-specific rare words that hold specific important
meanings. Furthermore, the absence of knowledge-intensive speech datasets in
academia has been a significant limiting factor, and the commonly used speech
corpora exhibit significant disparities with realistic conversation. To address
these challenges, we present Medical Interview (MED-IT), a multi-turn
consultation speech dataset that contains a substantial number of
knowledge-intensive named entities. We also explore methods to enhance the
recognition performance of rare words for E2E models. We propose a novel
approach, post-decoder biasing, which constructs a transform probability matrix
based on the distribution of training transcriptions. This guides the model to
prioritize recognizing words in the biasing list. In our experiments, for
subsets of rare words appearing in the training speech between 10 and 20 times,
and between 1 and 5 times, the proposed method achieves a relative improvement
of 9.3% and 5.1%, respectively.
Related papers
- An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition [10.234673954430221]
We study the impact of altering the context list to have words with different frequency distributions on model performance.
A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance.
arXiv Detail & Related papers (2024-09-10T12:52:36Z) - Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition [1.0690007351232649]
We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent.
Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1$%$ and 17.2$%$ in character error rate (CER) across multi accent test datasets.
arXiv Detail & Related papers (2024-07-03T11:35:52Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - Robust Acoustic and Semantic Contextual Biasing in Neural Transducers
for Speech Recognition [14.744220870243932]
We propose to use lightweight character representations to encode fine-grained pronunciation features to improve contextual biasing.
We further integrate pretrained neural language model (NLM) based encoders to encode the utterance's semantic context.
Experiments using a Conformer Transducer model on the Librispeech dataset show a 4.62% - 9.26% relative WER improvement on different biasing list sizes.
arXiv Detail & Related papers (2023-05-09T08:51:44Z) - Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.