Two-stage Textual Knowledge Distillation for End-to-End Spoken Language
Understanding
- URL: http://arxiv.org/abs/2010.13105v2
- Date: Thu, 10 Jun 2021 11:09:38 GMT
- Title: Two-stage Textual Knowledge Distillation for End-to-End Spoken Language
Understanding
- Authors: Seongbin Kim, Gyuwan Kim, Seongjin Shin, Sangmin Lee
- Abstract summary: This work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning.
We push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting.
- Score: 18.275646344620387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end approaches open a new way for more accurate and efficient spoken
language understanding (SLU) systems by alleviating the drawbacks of
traditional pipeline systems. Previous works exploit textual information for an
SLU model via pre-training with automatic speech recognition or fine-tuning
with knowledge distillation. To utilize textual information more effectively,
this work proposes a two-stage textual knowledge distillation method that
matches utterance-level representations and predicted logits of two modalities
during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a
speech encoder because it captures general and rich features. Furthermore, we
improve the performance, especially in a low-resource scenario, with data
augmentation methods by randomly masking spans of discrete audio tokens and
contextualized hidden representations. Consequently, we push the
state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy
in the full dataset setting and 99.5% in the 10% subset setting. Throughout the
ablation studies, we empirically verify that all used methods are crucial to
the final performance, providing the best practice for spoken language
understanding. Code is available at https://github.com/clovaai/textual-kd-slu.
Related papers
- Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition [17.59356583727259]
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications.
We propose an active learning (AL)-based fine-tuning framework for SER, called textscAfter.
Our proposed method improves accuracy by 8.45% and reduces time consumption by 79%.
arXiv Detail & Related papers (2024-05-01T04:05:29Z) - End-to-End Speech Recognition Contextualization with Large Language
Models [25.198480789044346]
We introduce a novel method for contextualizing speech recognition models incorporating Large Language Models (LLMs)
We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion.
Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided.
arXiv Detail & Related papers (2023-09-19T20:28:57Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z) - Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language
Understanding [14.157311972146692]
We propose a contextual E2E SLU model architecture that uses a multi-head attention mechanism over encoded previous utterances and dialogue acts.
Our method reduces average word and semantic error rates by 10.8% and 12.6%, respectively.
arXiv Detail & Related papers (2021-12-13T15:49:36Z) - Intent Classification Using Pre-Trained Embeddings For Low Resource
Languages [67.40810139354028]
Building Spoken Language Understanding systems that do not rely on language specific Automatic Speech Recognition is an important yet less explored problem in language processing.
We present a comparative study aimed at employing a pre-trained acoustic model to perform Spoken Language Understanding in low resource scenarios.
We perform experiments across three different languages: English, Sinhala, and Tamil each with different data sizes to simulate high, medium, and low resource scenarios.
arXiv Detail & Related papers (2021-10-18T13:06:59Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - End-to-End Spoken Language Understanding for Generalized Voice
Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.