Related papers: Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding

Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding

URL: http://arxiv.org/abs/2204.00558v1
Date: Fri, 1 Apr 2022 16:38:56 GMT
Title: Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding
Authors: Xuandi Fu, Feng-Ju Chang, Martin Radfar, Kai Wei, Jing Liu, Grant P. Strimel, Kanthashree Mysore Sathyendra
Abstract summary: End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency. We propose a streamable multi-task semantic transducer model to address these considerations. Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
Score: 16.381644007368763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency when compared to traditionally cascaded pipelines. Existing E2E SLU models usually follow a two-stage configuration where an Automatic Speech Recognition (ASR) network first predicts a transcript which is then passed to a Natural Language Understanding (NLU) module through an interface to infer semantic labels, such as intent and slot tags. This design, however, does not consider the NLU posterior while making transcript predictions, nor correct the NLU prediction error immediately by considering the previously predicted word-pieces. In addition, the NLU model in the two-stage system is not streamable, as it must wait for the audio segments to complete processing, which ultimately impacts the latency of the SLU system. In this work, we propose a streamable multi-task semantic transducer model to address these considerations. Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags while aggregating them through a fusion network. Using an industry scale SLU and a public FSC dataset, we show the proposed model outperforms the two-stage E2E SLU model for both ASR and NLU metrics.

Related papers

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems. By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations. Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z)
End-to-End Spoken Language Understanding: Performance analyses of a voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task. The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z)
Two-Pass Low Latency End-to-End Spoken Language Understanding [36.81762807197944]
We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations. We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass. Our code and models are publicly available as part of the ESPnet-SLU toolkit.
arXiv Detail & Related papers (2022-07-14T05:50:16Z)
STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
FANS: Fusing ASR and NLU for on-device SLU [16.1861817573118]
Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU) We introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio.
arXiv Detail & Related papers (2021-10-31T03:50:19Z)
An Effective Non-Autoregressive Model for Spoken Language Understanding [15.99246711701726]
We propose a novel non-autoregressive Spoken Language Understanding model named Layered-Refine Transformer. With SLG, the non-autoregressive model can efficiently obtain dependency information during training and spend no extra time in inference. Experiments on two public datasets indicate that our model significantly improves SLU performance (1.5% on Overall accuracy) while substantially speed up (more than 10 times) the inference process.
arXiv Detail & Related papers (2021-08-16T10:26:57Z)
End-to-End Spoken Language Understanding using RNN-Transducer ASR [14.267028645397266]
We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
arXiv Detail & Related papers (2021-06-30T09:20:32Z)
RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems. In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z)
Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.