Two-Pass Low Latency End-to-End Spoken Language Understanding
- URL: http://arxiv.org/abs/2207.06670v1
- Date: Thu, 14 Jul 2022 05:50:16 GMT
- Title: Two-Pass Low Latency End-to-End Spoken Language Understanding
- Authors: Siddhant Arora, Siddharth Dalmia, Xuankai Chang, Brian Yan, Alan
Black, Shinji Watanabe
- Abstract summary: We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations.
We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass.
Our code and models are publicly available as part of the ESPnet-SLU toolkit.
- Score: 36.81762807197944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) models are becoming increasingly popular for spoken language
understanding (SLU) systems and are beginning to achieve competitive
performance to pipeline-based approaches. However, recent work has shown that
these models struggle to generalize to new phrasings for the same intent
indicating that models cannot understand the semantic content of the given
utterance. In this work, we incorporated language models pre-trained on
unlabeled text data inside E2E-SLU frameworks to build strong semantic
representations. Incorporating both semantic and acoustic information can
increase the inference time, leading to high latency when deployed for
applications like voice assistants. We developed a 2-pass SLU system that makes
low latency prediction using acoustic information from the few seconds of the
audio in the first pass and makes higher quality prediction in the second pass
by combining semantic and acoustic representations. We take inspiration from
prior work on 2-pass end-to-end speech recognition systems that attends on both
audio and first-pass hypothesis using a deliberation network. The proposed
2-pass SLU system outperforms the acoustic-based SLU model on the Fluent Speech
Commands Challenge Set and SLURP dataset and reduces latency, thus improving
user experience. Our code and models are publicly available as part of the
ESPnet-SLU toolkit.
Related papers
- Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - End-to-end spoken language understanding using joint CTC loss and
self-supervised, pretrained acoustic encoders [13.722028186368737]
We leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification to extract textual embeddings.
Our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset.
arXiv Detail & Related papers (2023-05-04T15:36:37Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language
Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency.
We propose a streamable multi-task semantic transducer model to address these considerations.
Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.