End-to-End Spoken Language Understanding Without Full Transcripts
- URL: http://arxiv.org/abs/2009.14386v1
- Date: Wed, 30 Sep 2020 01:54:13 GMT
- Title: End-to-End Spoken Language Understanding Without Full Transcripts
- Authors: Hong-Kwang J. Kuo, Zolt\'an T\"uske, Samuel Thomas, Yinghui Huang,
Kartik Audhkhasi, Brian Kingsbury, Gakuto Kurata, Zvi Kons, Ron Hoory, and
Luis Lastras
- Abstract summary: We develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities.
We create two types of such speech-to-entities models, a CTC model and an attention-based encoder-decoder model.
For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words.
- Score: 38.19173637496798
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An essential component of spoken language understanding (SLU) is slot
filling: representing the meaning of a spoken utterance using semantic entity
labels. In this paper, we develop end-to-end (E2E) spoken language
understanding systems that directly convert speech input to semantic entities
and investigate if these E2E SLU models can be trained solely on semantic
entity annotations without word-for-word transcripts. Training such models is
very useful as they can drastically reduce the cost of data collection. We
created two types of such speech-to-entities models, a CTC model and an
attention-based encoder-decoder model, by adapting models trained originally
for speech recognition. Given that our experiments involve speech input, these
systems need to recognize both the entity label and words representing the
entity value correctly. For our speech-to-entities experiments on the ATIS
corpus, both the CTC and attention models showed impressive ability to skip
non-entity words: there was little degradation when trained on just entities
versus full transcripts. We also explored the scenario where the entities are
in an order not necessarily related to spoken order in the utterance. With its
ability to do re-ordering, the attention model did remarkably well, achieving
only about 2% degradation in speech-to-bag-of-entities F1 score.
Related papers
- Introducing Semantics into Speech Encoders [91.37001512418111]
We propose an unsupervised way of incorporating semantic information from large language models into self-supervised speech encoders without labeled audio transcriptions.
Our approach achieves similar performance as supervised methods trained on over 100 hours of labeled audio transcripts.
arXiv Detail & Related papers (2022-11-15T18:44:28Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Improving End-to-End Models for Set Prediction in Spoken Language
Understanding [26.781489293420055]
We propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order.
F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models, outperforming previously reported results.
arXiv Detail & Related papers (2022-01-28T13:23:17Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Towards Semi-Supervised Semantics Understanding from Speech [15.672850567147854]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus.
arXiv Detail & Related papers (2020-11-11T01:48:09Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.