End-to-End Spoken Language Understanding using RNN-Transducer ASR
- URL: http://arxiv.org/abs/2106.15919v1
- Date: Wed, 30 Jun 2021 09:20:32 GMT
- Title: End-to-End Spoken Language Understanding using RNN-Transducer ASR
- Authors: Anirudh Raju, Gautam Tiwari, Milind Rao, Pranav Dheram, Bryan
Anderson, Zhe Zhang, Bach Bui, Ariya Rastrow
- Abstract summary: We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance.
It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
- Score: 14.267028645397266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an end-to-end trained spoken language understanding (SLU) system
that extracts transcripts, intents and slots from an input speech utterance. It
consists of a streaming recurrent neural network transducer (RNNT) based
automatic speech recognition (ASR) model connected to a neural natural language
understanding (NLU) model through a neural interface. This interface allows for
end-to-end training using multi-task RNNT and NLU losses. Additionally, we
introduce semantic sequence loss training for the joint RNNT-NLU system that
allows direct optimization of non-differentiable SLU metrics. This end-to-end
SLU model paradigm can leverage state-of-the-art advancements and pretrained
models in both ASR and NLU research communities, outperforming recently
proposed direct speech-to-semantics models, and conventional pipelined ASR and
NLU systems. We show that this method improves both ASR and NLU metrics on both
public SLU datasets and large proprietary datasets.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Token-level Sequence Labeling for Spoken Language Understanding using
Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems.
By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations.
Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z) - Meta Auxiliary Learning for Low-resource Spoken Language Understanding [11.002938634213734]
Spoken language understanding (SLU) treats automatic speech recognition (ASR) and natural language understanding (NLU) as a unified task.
We exploit an ASR and NLU joint training method based on meta auxiliary learning to improve the performance of low-resource SLU task.
arXiv Detail & Related papers (2022-06-26T03:12:33Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language
Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency.
We propose a streamable multi-task semantic transducer model to address these considerations.
Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Do as I mean, not as I say: Sequence Loss Training for Spoken Language
Understanding [22.652754839140744]
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech.
We propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss.
We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics.
arXiv Detail & Related papers (2021-02-12T20:09:08Z) - Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces [17.030832205343195]
We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants.
An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios.
We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
arXiv Detail & Related papers (2020-08-14T02:43:57Z) - Progressive Tandem Learning for Pattern Recognition with Deep Spiking
Neural Networks [80.15411508088522]
Spiking neural networks (SNNs) have shown advantages over traditional artificial neural networks (ANNs) for low latency and high computational efficiency.
We propose a novel ANN-to-SNN conversion and layer-wise learning framework for rapid and efficient pattern recognition.
arXiv Detail & Related papers (2020-07-02T15:38:44Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.