Deliberation Model for On-Device Spoken Language Understanding
- URL: http://arxiv.org/abs/2204.01893v1
- Date: Mon, 4 Apr 2022 23:48:01 GMT
- Title: Deliberation Model for On-Device Spoken Language Understanding
- Authors: Duc Le, Akshat Shrivastava, Paden Tomasello, Suyoun Kim, Aleksandr
Livshits, Ozlem Kalinli, Michael L. Seltzer
- Abstract summary: We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
- Score: 69.5587671262691
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel deliberation-based approach to end-to-end (E2E) spoken
language understanding (SLU), where a streaming automatic speech recognition
(ASR) model produces the first-pass hypothesis and a second-pass natural
language understanding (NLU) component generates the semantic parse by
conditioning on both ASR's text and audio embeddings. By formulating E2E SLU as
a generalized decoder, our system is able to support complex compositional
semantic structures. Furthermore, the sharing of parameters between ASR and NLU
makes the system especially suitable for resource-constrained (on-device)
environments; our proposed approach consistently outperforms strong pipeline
NLU baselines by 0.82% to 1.34% across various operating points on the spoken
version of the TOPv2 dataset. We demonstrate that the fusion of text and audio
features, coupled with the system's ability to rewrite the first-pass
hypothesis, makes our approach more robust to ASR errors. Finally, we show that
our approach can significantly reduce the degradation when moving from natural
speech to synthetic speech training, but more work is required to make
text-to-speech (TTS) a viable solution for scaling up E2E SLU.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - End-to-End Spoken Language Understanding for Generalized Voice
Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Speech To Semantics: Improve ASR and NLU Jointly via All-Neural
Interfaces [17.030832205343195]
We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants.
An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios.
We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
arXiv Detail & Related papers (2020-08-14T02:43:57Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.