Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification
- URL: http://arxiv.org/abs/2102.07370v1
- Date: Mon, 15 Feb 2021 07:20:06 GMT
- Title: Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and
language Models for Intent Classification
- Authors: Bidisha Sharma, Maulik Madhavi and Haizhou Li
- Abstract summary: We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model.
We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
- Score: 81.80311855996584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intent classification is a task in spoken language understanding. An intent
classification system is usually implemented as a pipeline process, with a
speech recognition module followed by text processing that classifies the
intents. There are also studies of end-to-end system that takes acoustic
features as input and classifies the intents directly. Such systems don't take
advantage of relevant linguistic information, and suffer from limited training
data. In this work, we propose a novel intent classification framework that
employs acoustic features extracted from a pretrained speech recognition system
and linguistic features learned from a pretrained language model. We use
knowledge distillation technique to map the acoustic embeddings towards
linguistic embeddings. We perform fusion of both acoustic and linguistic
embeddings through cross-attention approach to classify intents. With the
proposed method, we achieve 90.86% and 99.07% accuracy on ATIS and Fluent
speech corpus, respectively.
Related papers
- Generalized zero-shot audio-to-intent classification [7.76114116227644]
We propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent.
We leverage a neural audio synthesizer to create audio embeddings for sample text utterances.
Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2%.
arXiv Detail & Related papers (2023-11-04T18:55:08Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Pretrained Semantic Speech Embeddings for End-to-End Spoken Language
Understanding via Cross-Modal Teacher-Student Learning [31.7865837105092]
We propose a novel training method that enables pretrained contextual embeddings to process acoustic features.
We extend it with an encoder of pretrained speech recognition systems in order to construct end-to-end spoken language understanding systems.
arXiv Detail & Related papers (2020-07-03T17:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.