Sequential End-to-End Intent and Slot Label Classification and
Localization
- URL: http://arxiv.org/abs/2106.04660v1
- Date: Tue, 8 Jun 2021 19:53:04 GMT
- Title: Sequential End-to-End Intent and Slot Label Classification and
Localization
- Authors: Yiran Cao, Nihal Potdar, Anderson R. Avila
- Abstract summary: end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency.
We propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values.
Results show our model ability to process incoming speech signal, reaching accuracy as high as 98.97 % for CTC and 98.78 % for CTL on single-label classification.
- Score: 2.1684857243537334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-computer interaction (HCI) is significantly impacted by delayed
responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken
language understanding (SLU) solutions have recently been proposed to decrease
latency. Such approaches allow for the extraction of semantic information
directly from the speech signal, thus bypassing the need for a transcript from
an automatic speech recognition (ASR) system. In this paper, we propose a
compact e2e SLU architecture for streaming scenarios, where chunks of the
speech signal are processed continuously to predict intent and slot values. Our
model is based on a 3D convolutional neural network (3D-CNN) and a
unidirectional long short-term memory (LSTM). We compare the performance of two
alignment-free losses: the connectionist temporal classification (CTC) method
and its adapted version, namely connectionist temporal localization (CTL). The
latter performs not only the classification but also localization of sequential
audio events. The proposed solution is evaluated on the Fluent Speech Command
dataset and results show our model ability to process incoming speech signal,
reaching accuracy as high as 98.97 % for CTC and 98.78 % for CTL on
single-label classification, and as high as 95.69 % for CTC and 95.28 % for CTL
on two-label prediction.
Related papers
- End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations [13.020158123538138]
Speech separation guided diarization (SSGD) performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream.
We consider three state-of-the-art speech separation (SSep) algorithms and study their performance in online and offline scenarios.
We show that our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model.
arXiv Detail & Related papers (2023-03-21T16:33:56Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Streaming End-to-End Multilingual Speech Recognition with Joint Language
Identification [14.197869575012925]
We propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor.
RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context.
Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER
arXiv Detail & Related papers (2022-09-13T15:10:41Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z) - End-to-End Neural Transformer Based Spoken Language Understanding [14.736425160859284]
Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals.
We introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots embedded in an audio signal.
Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 %, 99.6 %, and 99.6 %, respectively.
arXiv Detail & Related papers (2020-08-12T22:58:20Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.