Robust Prediction of Punctuation and Truecasing for Medical ASR
- URL: http://arxiv.org/abs/2007.02025v2
- Date: Sat, 11 Jul 2020 17:53:01 GMT
- Title: Robust Prediction of Punctuation and Truecasing for Medical ASR
- Authors: Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati,
Katrin Kirchhoff
- Abstract summary: This paper proposes a conditional joint modeling framework for prediction of punctuation and truecasing.
We also present techniques for domain and task specific adaptation by fine-tuning masked language models with medical domain data.
- Score: 18.08508027663331
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition (ASR) systems in the medical domain that focus
on transcribing clinical dictations and doctor-patient conversations often pose
many challenges due to the complexity of the domain. ASR output typically
undergoes automatic punctuation to enable users to speak naturally, without
having to vocalise awkward and explicit punctuation commands, such as "period",
"add comma" or "exclamation point", while truecasing enhances user readability
and improves the performance of downstream NLP tasks. This paper proposes a
conditional joint modeling framework for prediction of punctuation and
truecasing using pretrained masked language models such as BERT, BioBERT and
RoBERTa. We also present techniques for domain and task specific adaptation by
fine-tuning masked language models with medical domain data. Finally, we
improve the robustness of the model against common errors made in ASR by
performing data augmentation. Experiments performed on dictation and
conversational style corpora show that our proposed model achieves ~5% absolute
improvement on ground truth text and ~10% improvement on ASR outputs over
baseline models under F1 metric.
Related papers
- Enhancing AAC Software for Dysarthric Speakers in e-Health Settings: An Evaluation Using TORGO [0.13108652488669734]
Individuals with cerebral palsy (CP) and amyotrophic lateral sclerosis (ALS) frequently face challenges with articulation, leading to dysarthria and resulting in atypical speech patterns.
In healthcare settings, coomunication breakdowns reduce the quality of care.
We found that state-of-the-art (SOTA) automatic speech recognition (ASR) technology like Whisper and Wav2vec2.0 marginalizes atypical speakers largely due to the lack of training data.
arXiv Detail & Related papers (2024-11-01T19:11:54Z) - It's Never Too Late: Fusing Acoustic Information into Large Language
Models for Automatic Speech Recognition [70.77292069313154]
Large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output.
In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF)
arXiv Detail & Related papers (2024-02-08T07:21:45Z) - LibriSpeech-PC: Benchmark for Evaluation of Punctuation and
Capitalization Capabilities of end-to-end ASR Models [58.790604613878216]
We introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models.
The benchmark includes a LibriSpeech-PC dataset with restored punctuation and capitalization, a novel evaluation metric called Punctuation Error Rate (PER) that focuses on punctuation marks, and initial baseline models.
arXiv Detail & Related papers (2023-10-04T16:23:37Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Streaming Punctuation: A Novel Punctuation Technique Leveraging
Bidirectional Context for Continuous Speech Recognition [0.8670827427401333]
We propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows.
The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%.
arXiv Detail & Related papers (2023-01-10T07:07:20Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.