Joint Contextual Modeling for ASR Correction and Language Understanding
- URL: http://arxiv.org/abs/2002.00750v1
- Date: Tue, 28 Jan 2020 22:09:25 GMT
- Title: Joint Contextual Modeling for ASR Correction and Language Understanding
- Authors: Yue Weng, Sai Sumanth Miryala, Chandra Khatri, Runze Wang, Huaixiu
Zheng, Piero Molino, Mahdi Namazifar, Alexandros Papangelis, Hugh Williams,
Franziska Bell, Gokhan Tur
- Abstract summary: We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
- Score: 60.230013453699975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of automatic speech recognition (ASR) is critical to Dialogue
Systems as ASR errors propagate to and directly impact downstream tasks such as
language understanding (LU). In this paper, we propose multi-task neural
approaches to perform contextual language correction on ASR outputs jointly
with LU to improve the performance of both tasks simultaneously. To measure the
effectiveness of this approach we used a public benchmark, the 2nd Dialogue
State Tracking (DSTC2) corpus. As a baseline approach, we trained task-specific
Statistical Language Models (SLM) and fine-tuned state-of-the-art Generalized
Pre-training (GPT) Language Model to re-rank the n-best ASR hypotheses,
followed by a model to identify the dialog act and slots. i) We further trained
ranker models using GPT and Hierarchical CNN-RNN models with discriminatory
losses to detect the best output given n-best hypotheses. We extended these
ranker models to first select the best ASR output and then identify the
dialogue act and slots in an end to end fashion. ii) We also proposed a novel
joint ASR error correction and LU model, a word confusion pointer network
(WCN-Ptr) with multi-head self-attention on top, which consumes the word
confusions populated from the n-best. We show that the error rates of off the
shelf ASR and following LU systems can be reduced significantly by 14% relative
with joint models trained using small amounts of in-domain data.
Related papers
- Crossmodal ASR Error Correction with Discrete Speech Units [16.58209270191005]
We propose a post-ASR processing approach for ASR Error Correction (AEC)
We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon.
We propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality.
arXiv Detail & Related papers (2024-05-26T19:58:38Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Towards Contextual Spelling Correction for Customization of End-to-end
Speech Recognition Systems [27.483603895258437]
We introduce a novel approach to do contextual biasing by adding a contextual spelling correction model on top of the end-to-end ASR system.
We propose filtering algorithms to handle large-size context lists, and performance balancing mechanisms to control the biasing degree of the model.
Experiments show that the proposed method achieves as much as 51% relative word error rate (WER) reduction over ASR system and outperforms traditional biasing methods.
arXiv Detail & Related papers (2022-03-02T06:00:48Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Cross-sentence Neural Language Models for Conversational Speech
Recognition [17.317583079824423]
We propose an effective cross-sentence neural LM approach that reranks the ASR N-best hypotheses of an upcoming sentence.
We also explore to extract task-specific global topical information of the cross-sentence history.
arXiv Detail & Related papers (2021-06-13T05:30:16Z) - An Approach to Improve Robustness of NLP Systems against ASR Errors [39.57253455717825]
Speech-enabled systems typically first convert audio to text through an automatic speech recognition model and then feed the text to downstream natural language processing modules.
The errors of the ASR system can seriously downgrade the performance of the NLP modules.
Previous work has shown it is effective to employ data augmentation methods to solve this problem by injecting ASR noise during the training process.
arXiv Detail & Related papers (2021-03-25T05:15:43Z) - Fine-tuning of Pre-trained End-to-end Speech Recognition with Generative
Adversarial Networks [10.723935272906461]
Adversarial training of end-to-end (E2E) ASR systems using generative adversarial networks (GAN) has recently been explored.
We introduce a novel framework for fine-tuning a pre-trained ASR model using the GAN objective.
Our proposed approach outperforms baselines and conventional GAN-based adversarial models.
arXiv Detail & Related papers (2021-03-10T17:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.