Building Robust Spoken Language Understanding by Cross Attention between
Phoneme Sequence and ASR Hypothesis
- URL: http://arxiv.org/abs/2203.12067v1
- Date: Tue, 22 Mar 2022 21:59:29 GMT
- Title: Building Robust Spoken Language Understanding by Cross Attention between
Phoneme Sequence and ASR Hypothesis
- Authors: Zexun Wang, Yuquan Le, Yi Zhu, Yuming Zhao, Mingchao Feng, Meng Chen,
Xiaodong He
- Abstract summary: This paper proposes a novel model with Cross Attention for SLU (denoted as CASLU)
The cross attention block is devised to catch the fine-grained interactions between phoneme and word embeddings in order to make the joint representations catch the phonetic and semantic features of input simultaneously.
Extensive experiments are conducted on three datasets, showing the effectiveness and competitiveness of our approach.
- Score: 15.159439853075645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building Spoken Language Understanding (SLU) robust to Automatic Speech
Recognition (ASR) errors is an essential issue for various voice-enabled
virtual assistants. Considering that most ASR errors are caused by phonetic
confusion between similar-sounding expressions, intuitively, leveraging the
phoneme sequence of speech can complement ASR hypothesis and enhance the
robustness of SLU. This paper proposes a novel model with Cross Attention for
SLU (denoted as CASLU). The cross attention block is devised to catch the
fine-grained interactions between phoneme and word embeddings in order to make
the joint representations catch the phonetic and semantic features of input
simultaneously and for overcoming the ASR errors in downstream natural language
understanding (NLU) tasks. Extensive experiments are conducted on three
datasets, showing the effectiveness and competitiveness of our approach.
Additionally, We also validate the universality of CASLU and prove its
complementarity when combining with other robust SLU techniques.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - Multimodal Audio-textual Architecture for Robust Spoken Language
Understanding [18.702076738332867]
multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors in the ASR transcript.
Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines.
Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
arXiv Detail & Related papers (2023-06-12T01:55:53Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Contextualized Attention-based Knowledge Transfer for Spoken
Conversational Question Answering [63.72278693825945]
Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow.
We propose CADNet, a novel contextualized attention-based distillation approach.
We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance.
arXiv Detail & Related papers (2020-10-21T15:17:18Z) - Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces [17.030832205343195]
We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants.
An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios.
We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
arXiv Detail & Related papers (2020-08-14T02:43:57Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.