Leveraging Pretrained ASR Encoders for Effective and Efficient
End-to-End Speech Intent Classification and Slot Filling
- URL: http://arxiv.org/abs/2307.07057v1
- Date: Thu, 13 Jul 2023 20:50:19 GMT
- Title: Leveraging Pretrained ASR Encoders for Effective and Efficient
End-to-End Speech Intent Classification and Slot Filling
- Authors: He Huang, Jagadeesh Balam and Boris Ginsburg
- Abstract summary: We propose to use an encoder pretrained on speech recognition (ASR) to initialize an end-to-end (E2E) Conformer-Transformer model.
Our model achieves the new state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and 82.27% SLURP-F1.
- Score: 13.515248068374625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study speech intent classification and slot filling (SICSF) by proposing
to use an encoder pretrained on speech recognition (ASR) to initialize an
end-to-end (E2E) Conformer-Transformer model, which achieves the new
state-of-the-art results on the SLURP dataset, with 90.14% intent accuracy and
82.27% SLURP-F1. We compare our model with encoders pretrained on
self-supervised learning (SSL), and show that ASR pretraining is much more
effective than SSL for SICSF. To explore parameter efficiency, we freeze the
encoder and add Adapter modules, and show that parameter efficiency is only
achievable with an ASR-pretrained encoder, while the SSL encoder needs full
finetuning to achieve comparable results. In addition, we provide an in-depth
comparison on end-to-end models versus cascading models (ASR+NLU), and show
that E2E models are better than cascaded models unless an oracle ASR model is
provided. Last but not least, our model is the first E2E model that achieves
the same performance as cascading models with oracle ASR. Code, checkpoints and
configs are available.
Related papers
- Efficient infusion of self-supervised representations in Automatic Speech Recognition [1.2972104025246092]
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks.
We propose two simple approaches that use framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model into the ASR architecture.
Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets.
arXiv Detail & Related papers (2024-04-19T05:01:12Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - LegoNet: A Fast and Exact Unlearning Architecture [59.49058450583149]
Machine unlearning aims to erase the impact of specific training samples upon deleted requests from a trained model.
We present a novel network, namely textitLegoNet, which adopts the framework of fixed encoder + multiple adapters''
We show that LegoNet accomplishes fast and exact unlearning while maintaining acceptable performance, synthetically outperforming unlearning baselines.
arXiv Detail & Related papers (2022-10-28T09:53:05Z) - Joint Encoder-Decoder Self-Supervised Pre-training for ASR [0.0]
Self-supervised learning has shown tremendous success in various speech-related downstream tasks.
In this paper, we propose a new paradigm that exploits the power of a decoder during self-supervised learning.
arXiv Detail & Related papers (2022-06-09T12:45:29Z) - E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language
Understanding and Generation [95.49128988683191]
Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining language models.
We propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2.
E2S2 improves the seq2seq models via integrating more efficient self-supervised information into the encoders.
arXiv Detail & Related papers (2022-05-30T08:25:36Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Encoding Syntactic Knowledge in Transformer Encoder for Intent Detection
and Slot Filling [6.234581622120001]
We propose a novel Transformer encoder-based architecture with syntactical knowledge encoded for intent detection and slot filling.
We encode syntactic knowledge into the Transformer encoder by jointly training it to predict syntactic parse ancestors and part-of-speech of each token via multi-task learning.
arXiv Detail & Related papers (2020-12-21T21:25:11Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.