RNN Transducer Models For Spoken Language Understanding
- URL: http://arxiv.org/abs/2104.03842v1
- Date: Thu, 8 Apr 2021 15:35:22 GMT
- Title: RNN Transducer Models For Spoken Language Understanding
- Authors: Samuel Thomas, Hong-Kwang J. Kuo, George Saon, Zolt\'an T\"uske, Brian
Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory
- Abstract summary: We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
- Score: 49.07149742835825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a comprehensive study on building and adapting RNN transducer
(RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E)
models are constructed in three practical settings: a case where verbatim
transcripts are available, a constrained case where the only available
annotations are SLU labels and their values, and a more restrictive case where
transcripts are available but not corresponding audio. We show how RNN-T SLU
models can be developed starting from pre-trained automatic speech recognition
(ASR) systems, followed by an SLU adaptation step. In settings where real audio
data is not available, artificially synthesized speech is used to successfully
adapt various SLU models. When evaluated on two SLU data sets, the ATIS corpus
and a customer call center data set, the proposed models closely track the
performance of other E2E models and achieve state-of-the-art results.
Related papers
- End-to-end spoken language understanding using joint CTC loss and
self-supervised, pretrained acoustic encoders [13.722028186368737]
We leverage self-supervised acoustic encoders fine-tuned with Connectionist Temporal Classification to extract textual embeddings.
Our model achieves 4% absolute improvement over the the state-of-the-art (SOTA) dialogue act classification model on the DSTC2 dataset.
arXiv Detail & Related papers (2023-05-04T15:36:37Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Towards Semi-Supervised Semantics Understanding from Speech [15.672850567147854]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus.
arXiv Detail & Related papers (2020-11-11T01:48:09Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.