End-to-end spoken language understanding using joint CTC loss and
self-supervised, pretrained acoustic encoders
- URL: http://arxiv.org/abs/2305.02937v2
- Date: Fri, 2 Jun 2023 13:25:06 GMT
- Title: End-to-end spoken language understanding using joint CTC loss and
self-supervised, pretrained acoustic encoders
- Authors: Jixuan Wang, Martin Radfar, Kai Wei, Clement Chung
- Abstract summary: We leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification to extract textual embeddings.
Our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset.
- Score: 13.722028186368737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is challenging to extract semantic meanings directly from audio signals in
spoken language understanding (SLU), due to the lack of textual information.
Popular end-to-end (E2E) SLU models utilize sequence-to-sequence automatic
speech recognition (ASR) models to extract textual embeddings as input to infer
semantics, which, however, require computationally expensive auto-regressive
decoding. In this work, we leverage self-supervised acoustic encoders
fine-tuned with Connectionist Temporal Classification (CTC) to extract textual
embeddings and use joint CTC and SLU losses for utterance-level SLU tasks.
Experiments show that our model achieves 4% absolute improvement over the the
state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset
and 1.3% absolute improvement over the SOTA SLU model on the SLURP dataset.
Related papers
- Token-level Sequence Labeling for Spoken Language Understanding using
Compositional End-to-End Models [94.30953696090758]
We build compositional end-to-end spoken language understanding systems.
By relying on intermediate decoders trained for ASR, our end-to-end systems transform the input modality from speech to token-level representations.
Our models outperform both cascaded and direct end-to-end models on a labeling task of named entity recognition.
arXiv Detail & Related papers (2022-10-27T19:33:18Z) - Two-Pass Low Latency End-to-End Spoken Language Understanding [36.81762807197944]
We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations.
We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass.
Our code and models are publicly available as part of the ESPnet-SLU toolkit.
arXiv Detail & Related papers (2022-07-14T05:50:16Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Do as I mean, not as I say: Sequence Loss Training for Spoken Language
Understanding [22.652754839140744]
Spoken language understanding (SLU) systems extract transcriptions, as well as semantics of intent or named entities from speech.
We propose non-differentiable sequence losses based on SLU metrics as a proxy for semantic error and use the REINFORCE trick to train ASR and SLU models with this loss.
We show that custom sequence loss training is the state-of-the-art on open SLU datasets and leads to 6% relative improvement in both ASR and NLU performance metrics.
arXiv Detail & Related papers (2021-02-12T20:09:08Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.