End-to-End Spoken Language Understanding for Generalized Voice
Assistants
- URL: http://arxiv.org/abs/2106.09009v1
- Date: Wed, 16 Jun 2021 17:56:47 GMT
- Title: End-to-End Spoken Language Understanding for Generalized Voice
Assistants
- Authors: Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios
Mouchtaris
- Abstract summary: We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
- Score: 15.241812584273886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) spoken language understanding (SLU) systems predict
utterance semantics directly from speech using a single model. Previous work in
this area has focused on targeted tasks in fixed domains, where the output
semantic structure is assumed a priori and the input speech is of limited
complexity. In this work we present our approach to developing an E2E model for
generalized SLU in commercial voice assistants (VAs). We propose a fully
differentiable, transformer-based, hierarchical system that can be pretrained
at both the ASR and NLU levels. This is then fine-tuned on both transcription
and semantic classification losses to handle a diverse set of intent and
argument combinations. This leads to an SLU system that achieves significant
improvements over baselines on a complex internal generalized VA dataset with a
43% improvement in accuracy, while still meeting the 99% accuracy benchmark on
the popular Fluent Speech Commands dataset. We further evaluate our model on a
hard test set, exclusively containing slot arguments unseen in training, and
demonstrate a nearly 20% improvement, showing the efficacy of our approach in
truly demanding VA scenarios.
Related papers
- Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Towards Semi-Supervised Semantics Understanding from Speech [15.672850567147854]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus.
arXiv Detail & Related papers (2020-11-11T01:48:09Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.