Improving Spoken Language Understanding By Exploiting ASR N-best
Hypotheses
- URL: http://arxiv.org/abs/2001.05284v1
- Date: Sat, 11 Jan 2020 05:48:52 GMT
- Title: Improving Spoken Language Understanding By Exploiting ASR N-best
Hypotheses
- Authors: Mingda Li, Weitong Ruan, Xinyue Liu, Luca Soldaini, Wael Hamza,
Chengwei Su
- Abstract summary: The natural language understanding (NLU) module takes interpretations of a speech from the automatic speech recognition (ASR) module as the input.
The ASR module might misrecognize some speeches and the first best interpretation could be erroneous and noisy.
We introduce a series of simple yet efficient models for improving the understanding of semantics of the input speeches.
- Score: 22.332683746361294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In a modern spoken language understanding (SLU) system, the natural language
understanding (NLU) module takes interpretations of a speech from the automatic
speech recognition (ASR) module as the input. The NLU module usually uses the
first best interpretation of a given speech in downstream tasks such as domain
and intent classification. However, the ASR module might misrecognize some
speeches and the first best interpretation could be erroneous and noisy. Solely
relying on the first best interpretation could make the performance of
downstream tasks non-optimal. To address this issue, we introduce a series of
simple yet efficient models for improving the understanding of semantics of the
input speeches by collectively exploiting the n-best speech interpretations
from the ASR module.
Related papers
- DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities.
Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks.
These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Instruction-Following Speech Recognition [21.591086644665197]
We introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions.
Remarkably, our model, trained from scratch on Librispeech, interprets and executes simple instructions without requiring Large Language Models or pre-trained speech modules.
arXiv Detail & Related papers (2023-09-18T14:59:10Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces [17.030832205343195]
We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants.
An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios.
We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
arXiv Detail & Related papers (2020-08-14T02:43:57Z) - Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation [15.225080891662675]
Speech comprehension can benefit from inference of massive pre-trained language models.
We experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module.
arXiv Detail & Related papers (2020-05-17T10:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.