Building an ASR Error Robust Spoken Virtual Patient System in a Highly
Class-Imbalanced Scenario Without Speech Data
- URL: http://arxiv.org/abs/2204.05183v1
- Date: Mon, 11 Apr 2022 15:13:33 GMT
- Title: Building an ASR Error Robust Spoken Virtual Patient System in a Highly
Class-Imbalanced Scenario Without Speech Data
- Authors: Vishal Sunder, Prashant Serai, Eric Fosler-Lussier
- Abstract summary: A Virtual Patient (VP) is a powerful tool for training medical students to take patient histories.
Our method shows significant improvements over strong baselines on the VP intent classification task at various word error rate settings.
- Score: 20.357369521268417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A Virtual Patient (VP) is a powerful tool for training medical students to
take patient histories, where responding to a diverse set of spoken questions
is essential to simulate natural conversations with a student. The performance
of such a Spoken Language Understanding system (SLU) can be adversely affected
by both the presence of Automatic Speech Recognition (ASR) errors in the test
data and a high degree of class imbalance in the SLU training data. While these
two issues have been addressed separately in prior work, we develop a novel
two-step training methodology that tackles both these issues effectively in a
single dialog agent. As it is difficult to collect spoken data from users
without a functioning SLU system, our method does not rely on spoken data for
training, rather we use an ASR error predictor to "speechify" the text data.
Our method shows significant improvements over strong baselines on the VP
intent classification task at various word error rate settings.
Related papers
- Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - The Far Side of Failure: Investigating the Impact of Speech Recognition
Errors on Subsequent Dementia Classification [8.032686410648274]
Linguistic anomalies detectable in spontaneous speech have shown promise for various clinical applications including screening for dementia and other forms of cognitive impairment.
The impressive performance of self-supervised learning (SSL) automatic speech recognition (ASR) models with curated speech data is not apparent with challenging speech samples from clinical settings.
One of our key findings is that, paradoxically, ASR systems with relatively high error rates can produce transcripts that result in better downstream classification accuracy than classification based on verbatim transcripts.
arXiv Detail & Related papers (2022-11-11T17:06:45Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Data Augmentation for Training Dialog Models Robust to Speech
Recognition Errors [5.53506103787497]
Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users' audio signals to text data through automatic speech recognition (ASR)
The ASR output is error-prone; however, the downstream dialog models are often trained on error-free text data, making them sensitive to ASR errors during inference time.
We leverage an ASR error simulator to inject noise into the error-free text data, and subsequently train the dialog models with the augmented data.
arXiv Detail & Related papers (2020-06-10T03:18:15Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.