Speech-language Pre-training for End-to-end Spoken Language
Understanding
- URL: http://arxiv.org/abs/2102.06283v1
- Date: Thu, 11 Feb 2021 21:55:48 GMT
- Title: Speech-language Pre-training for End-to-end Spoken Language
Understanding
- Authors: Yao Qian, Ximo Bian, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao and
Michael Zeng
- Abstract summary: We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
- Score: 18.548949994603213
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end (E2E) spoken language understanding (SLU) can infer semantics
directly from speech signal without cascading an automatic speech recognizer
(ASR) with a natural language understanding (NLU) module. However, paired
utterance recordings and corresponding semantics may not always be available or
sufficient to train an E2E SLU model in a real production environment. In this
paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a
pre-trained language model encoder (language) into a transformer decoder. The
unified speech-language pre-trained model (SLP) is continually enhanced on
limited labeled data from a target domain by using a conditional masked
language model (MLM) objective, and thus can effectively generate a sequence of
intent, slot type, and slot value for given input speech in the inference. The
experimental results on two public corpora show that our approach to E2E SLU is
superior to the conventional cascaded method. It also outperforms the present
state-of-the-art approaches to E2E SLU with much less paired data.
Related papers
- One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting [0.3867363075280543]
We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
arXiv Detail & Related papers (2022-07-17T13:51:56Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Improving End-to-End Models for Set Prediction in Spoken Language
Understanding [26.781489293420055]
We propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order.
F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models, outperforming previously reported results.
arXiv Detail & Related papers (2022-01-28T13:23:17Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Towards Semi-Supervised Semantics Understanding from Speech [15.672850567147854]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed speech to address these.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU corpus.
arXiv Detail & Related papers (2020-11-11T01:48:09Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.