Related papers: FANS: Fusing ASR and NLU for on-device SLU

FANS: Fusing ASR and NLU for on-device SLU

URL: http://arxiv.org/abs/2111.00400v1
Date: Sun, 31 Oct 2021 03:50:19 GMT
Title: FANS: Fusing ASR and NLU for on-device SLU
Authors: Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow
Abstract summary: Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU) We introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio.
Score: 16.1861817573118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription. FANS consists of a shared audio encoder and three decoders, two of which are seq-to-seq decoders that predict non null slot tags and slot values in parallel and in an auto-regressive manner. FANS neural encoder and decoders architectures are flexible which allows us to leverage different combinations of LSTM, self-attention, and attenders. Our experiments show compared to the state-of-the-art end-to-end SLU models, FANS reduces ICER and IRER errors relatively by 30 % and 7 %, respectively, when tested on an in-house SLU dataset and by 0.86 % and 2 % absolute when tested on a public SLU dataset.

Related papers

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker. Network addresses limitations of SIMO models by aggregating cross-speaker representations. Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z)
Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding [29.971414483624823]
We propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LMworks into the SLU formulation for sequence generation tasks. Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets.
arXiv Detail & Related papers (2023-07-20T16:34:40Z)
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding [18.702076738332867]
multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors in the ASR transcript. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
arXiv Detail & Related papers (2023-06-12T01:55:53Z)
End-to-end spoken language understanding using joint CTC loss and self-supervised, pretrained acoustic encoders [13.722028186368737]
We leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification to extract textual embeddings. Our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset.
arXiv Detail & Related papers (2023-05-04T15:36:37Z)
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems. By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations. Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z)
STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available. In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z)
Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU) We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency. We propose a streamable multi-task semantic transducer model to address these considerations. Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z)
RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems. In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z)
Adaptive Feature Selection for End-to-End Speech Translation [87.07211937607102]
We propose adaptive feature selection (AFS) for encoder-decoder based E2E speech translation (ST) We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR. We take L0DROP as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions.
arXiv Detail & Related papers (2020-10-16T17:21:00Z)
Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces [17.030832205343195]
We consider the problem of spoken language understanding (SLU) of extracting natural language intents from speech directed at voice assistants. An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios. We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.
arXiv Detail & Related papers (2020-08-14T02:43:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.