Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks
- URL: http://arxiv.org/abs/2401.02921v1
- Date: Fri, 5 Jan 2024 17:58:10 GMT
- Title: Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks
- Authors: Kevin Everson, Yile Gu, Huck Yang, Prashanth Gurunath Shivakumar,
Guan-Ting Lin, Jari Kolehmainen, Ivan Bulyko, Ankur Gandhe, Shalini Ghosh,
Wael Hamza, Hung-yi Lee, Ariya Rastrow, Andreas Stolcke
- Abstract summary: We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
- Score: 68.79880423713597
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In the realm of spoken language understanding (SLU), numerous natural
language understanding (NLU) methodologies have been adapted by supplying large
language models (LLMs) with transcribed speech instead of conventional written
text. In real-world scenarios, prior to input into an LLM, an automated speech
recognition (ASR) system generates an output transcript hypothesis, where
inherent errors can degrade subsequent SLU tasks. Here we introduce a method
that utilizes the ASR system's lattice output instead of relying solely on the
top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU
outcomes. Our in-context learning experiments, covering spoken question
answering and intent classification, underline the LLM's resilience to noisy
speech transcripts with the help of word confusion networks from lattices,
bridging the SLU performance gap between using the top ASR hypothesis and an
oracle upper bound. Additionally, we delve into the LLM's robustness to varying
ASR performance conditions and scrutinize the aspects of in-context learning
which prove the most influential.
Related papers
- Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks.
This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning.
We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z) - Large Language Models are Efficient Learners of Noise-Robust Speech
Recognition [65.95847272465124]
Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR)
In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER.
Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate.
arXiv Detail & Related papers (2024-01-19T01:29:27Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - Leveraging Large Language Models for Exploiting ASR Uncertainty [16.740712975166407]
Large language models must either rely on off-the-shelf automatic speech recognition systems for transcription, or be equipped with an in-built speech modality.
We tackle speech-intent classification task, where a high word-error-rate can limit the LLM's ability to understand the spoken intent.
We propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis.
arXiv Detail & Related papers (2023-09-09T17:02:33Z) - Exploring the Integration of Large Language Models into Automatic Speech
Recognition Systems: An Empirical Study [0.0]
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems.
Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems.
arXiv Detail & Related papers (2023-07-13T02:31:55Z) - Multimodal Audio-textual Architecture for Robust Spoken Language
Understanding [18.702076738332867]
multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors in the ASR transcript.
Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines.
Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
arXiv Detail & Related papers (2023-06-12T01:55:53Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation [15.225080891662675]
Speech comprehension can benefit from inference of massive pre-trained language models.
We experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module.
arXiv Detail & Related papers (2020-05-17T10:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.