Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces
- URL: http://arxiv.org/abs/2008.06173v1
- Date: Fri, 14 Aug 2020 02:43:57 GMT
- Title: Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces
- Authors: Milind Rao, Anirudh Raju, Pranav Dheram, Bach Bui, Ariya Rastrow
- Abstract summary: We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants.
An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios.
We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
- Score: 17.030832205343195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of spoken language understanding (SLU) of extracting
natural language intents and associated slot arguments or named entities from
speech that is primarily directed at voice assistants. Such a system subsumes
both automatic speech recognition (ASR) as well as natural language
understanding (NLU). An end-to-end joint SLU model can be built to a required
specification opening up the opportunity to deploy on hardware constrained
scenarios like devices enabling voice assistants to work offline, in a privacy
preserving manner, whilst also reducing server costs.
We first present models that extract utterance intent directly from speech
without intermediate text output. We then present a compositional model, which
generates the transcript using the Listen Attend Spell ASR system and then
extracts interpretation using a neural NLU model. Finally, we contrast these
methods to a jointly trained end-to-end joint SLU model, consisting of ASR and
NLU subsystems which are connected by a neural network based interface instead
of text, that produces transcripts as well as NLU interpretation. We show that
the jointly trained model shows improvements to ASR incorporating semantic
information from NLU and also improves NLU by exposing it to ASR confusion
encoded in the hidden layer.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Multimodal Audio-textual Architecture for Robust Spoken Language
Understanding [18.702076738332867]
multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors in the ASR transcript.
Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines.
Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
arXiv Detail & Related papers (2023-06-12T01:55:53Z) - Improving Textless Spoken Language Understanding with Discrete Units as
Intermediate Target [58.59044226658916]
Spoken Language Understanding (SLU) is a task that aims to extract semantic information from spoken utterances.
We propose to use discrete units as intermediate guidance to improve textless SLU performance.
arXiv Detail & Related papers (2023-05-29T14:00:24Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - End-to-End Spoken Language Understanding using RNN-Transducer ASR [14.267028645397266]
We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance.
It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
arXiv Detail & Related papers (2021-06-30T09:20:32Z) - Do as I mean, not as I say: Sequence Loss Training for Spoken Language
Understanding [22.652754839140744]
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech.
We propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss.
We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics.
arXiv Detail & Related papers (2021-02-12T20:09:08Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.