Integrating Pretrained ASR and LM to Perform Sequence Generation for
Spoken Language Understanding
- URL: http://arxiv.org/abs/2307.11005v1
- Date: Thu, 20 Jul 2023 16:34:40 GMT
- Title: Integrating Pretrained ASR and LM to Perform Sequence Generation for
Spoken Language Understanding
- Authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian
Yan, Shinji Watanabe
- Abstract summary: We propose a three-pass end-to-end (E2E) SLU system that effectively integrates ASR and LMworks into the SLU formulation for sequence generation tasks.
Our proposed three-pass SLU system shows improved performance over cascaded and E2E SLU models on two benchmark SLU datasets.
- Score: 29.971414483624823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been an increased interest in the integration of pretrained speech
recognition (ASR) and language models (LM) into the SLU framework. However,
prior methods often struggle with a vocabulary mismatch between pretrained
models, and LM cannot be directly utilized as they diverge from its NLU
formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system
that effectively integrates ASR and LM subnetworks into the SLU formulation for
sequence generation tasks. In the first pass, our architecture predicts ASR
transcripts using the ASR subnetwork. This is followed by the LM subnetwork,
which makes an initial SLU prediction. Finally, in the third pass, the
deliberation subnetwork conditions on representations from the ASR and LM
subnetworks to make the final prediction. Our proposed three-pass SLU system
shows improved performance over cascaded and E2E SLU models on two benchmark
SLU datasets, SLURP and SLUE, especially on acoustically challenging
utterances.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for
Improving ASR Robustness in Spoken Language Understanding [55.39105863825107]
We propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL) to improve automatic speech recognition (ASR) robustness.
In fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively.
Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance.
arXiv Detail & Related papers (2023-11-19T16:53:35Z) - Multimodal Audio-textual Architecture for Robust Spoken Language
Understanding [18.702076738332867]
multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors in the ASR transcript.
Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines.
Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models' performance across all datasets for the academic ASR engine.
arXiv Detail & Related papers (2023-06-12T01:55:53Z) - Token-level Sequence Labeling for Spoken Language Understanding using
Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems.
By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations.
Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language
Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency.
We propose a streamable multi-task semantic transducer model to address these considerations.
Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z) - End-to-End Spoken Language Understanding using RNN-Transducer ASR [14.267028645397266]
We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance.
It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
arXiv Detail & Related papers (2021-06-30T09:20:32Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Do as I mean, not as I say: Sequence Loss Training for Spoken Language
Understanding [22.652754839140744]
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech.
We propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss.
We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics.
arXiv Detail & Related papers (2021-02-12T20:09:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.