Improving End-to-End Models for Set Prediction in Spoken Language
Understanding
- URL: http://arxiv.org/abs/2201.12105v1
- Date: Fri, 28 Jan 2022 13:23:17 GMT
- Title: Improving End-to-End Models for Set Prediction in Spoken Language
Understanding
- Authors: Hong-Kwang J. Kuo, Zoltan Tuske, Samuel Thomas, Brian Kingsbury,
George Saon
- Abstract summary: We propose a novel data augmentation technique along with an implicit attention based alignment method to infer the spoken order.
F1 scores significantly increased by more than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models, outperforming previously reported results.
- Score: 26.781489293420055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of spoken language understanding (SLU) systems is to determine the
meaning of the input speech signal, unlike speech recognition which aims to
produce verbatim transcripts. Advances in end-to-end (E2E) speech modeling have
made it possible to train solely on semantic entities, which are far cheaper to
collect than verbatim transcripts. We focus on this set prediction problem,
where entity order is unspecified. Using two classes of E2E models, RNN
transducers and attention based encoder-decoders, we show that these models
work best when the training entity sequence is arranged in spoken order. To
improve E2E SLU models when entity spoken order is unknown, we propose a novel
data augmentation technique along with an implicit attention based alignment
method to infer the spoken order. F1 scores significantly increased by more
than 11% for RNN-T and about 2% for attention based encoder-decoder SLU models,
outperforming previously reported results.
Related papers
- SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models.
We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2.
E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - End-to-End Spoken Language Understanding Without Full Transcripts [38.19173637496798]
We develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities.
We create two types of such speech-to-entities models, a CTC model and an attention-based encoder-decoder model.
For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words.
arXiv Detail & Related papers (2020-09-30T01:54:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.