Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification
- URL: http://arxiv.org/abs/2108.02598v1
- Date: Thu, 5 Aug 2021 13:08:13 GMT
- Title: Knowledge Distillation from BERT Transformer to Speech Transformer for
Intent Classification
- Authors: Yidi Jiang, Bidisha Sharma, Maulik Madhavi, and Haizhou Li
- Abstract summary: We exploit the scope of the transformer distillation method that is specifically designed for knowledge distillation from a transformer based language model to a transformer based speech model.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent speech corpus and ATIS database, respectively.
- Score: 66.62686601948455
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: End-to-end intent classification using speech has numerous advantages
compared to the conventional pipeline approach using automatic speech
recognition (ASR), followed by natural language processing modules. It attempts
to predict intent from speech without using an intermediate ASR module.
However, such end-to-end framework suffers from the unavailability of large
speech resources with higher acoustic variation in spoken language
understanding. In this work, we exploit the scope of the transformer
distillation method that is specifically designed for knowledge distillation
from a transformer based language model to a transformer based speech model. In
this regard, we leverage the reliable and widely used bidirectional encoder
representations from transformers (BERT) model as a language model and transfer
the knowledge to build an acoustic model for intent classification using the
speech. In particular, a multilevel transformer based teacher-student model is
designed, and knowledge distillation is performed across attention and hidden
sub-layers of different transformer layers of the student and teacher models.
We achieve an intent classification accuracy of 99.10% and 88.79% for Fluent
speech corpus and ATIS database, respectively. Further, the proposed method
demonstrates better performance and robustness in acoustically degraded
condition compared to the baseline method.
Related papers
- Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement [17.645026729525462]
We propose a transformer-based end-to-end model to extract a target speaker's speech from a mixed audio signal.
Our experiments show that the use of a dual path transformer in the separator backbone along with proposed training paradigm improves the CNN baseline by $3.12$ dB points.
arXiv Detail & Related papers (2024-09-02T16:11:12Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Multi-View Self-Attention Based Transformer for Speaker Recognition [33.21173007319178]
Transformer model is widely used for speech processing tasks such as speaker recognition.
We propose a novel multi-view self-attention mechanism for speaker Transformer.
We show that the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.
arXiv Detail & Related papers (2021-10-11T07:03:23Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Investigation of Speaker-adaptation methods in Transformer based ASR [8.637110868126548]
This paper explores different ways of incorporating speaker information at the encoder input while training a transformer-based model to improve its speech recognition performance.
We present speaker information in the form of speaker embeddings for each of the speakers.
We obtain improvements in the word error rate over the baseline through our approach of integrating speaker embeddings into the model.
arXiv Detail & Related papers (2020-08-07T16:09:03Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z) - Transformer-based language modeling and decoding for conversational
speech recognition [0.0]
We focus on decoding efficiently in a weighted finite-state transducer framework.
We showcase an approach to lattice re-scoring that allows for longer range history captured by a transfomer-based language model.
arXiv Detail & Related papers (2020-01-04T23:27:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.