A Streaming End-to-End Framework For Spoken Language Understanding
- URL: http://arxiv.org/abs/2105.10042v1
- Date: Thu, 20 May 2021 21:37:05 GMT
- Title: A Streaming End-to-End Framework For Spoken Language Understanding
- Authors: Nihal Potdar, Anderson R. Avila, Chao Xing, Dong Wang, Yiran Cao, Xiao
Chen
- Abstract summary: We propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way.
We evaluate our solution on the Fluent Speech Commands dataset and the intent detection accuracy is about 97 % on all multi-intent settings.
- Score: 11.58499117295424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end spoken language understanding (SLU) has recently attracted
increasing interest. Compared to the conventional tandem-based approach that
combines speech recognition and language understanding as separate modules, the
new approach extracts users' intentions directly from the speech signals,
resulting in joint optimization and low latency. Such an approach, however, is
typically designed to process one intention at a time, which leads users to
take multiple rounds to fulfill their requirements while interacting with a
dialogue system. In this paper, we propose a streaming end-to-end framework
that can process multiple intentions in an online and incremental way. The
backbone of our framework is a unidirectional RNN trained with the
connectionist temporal classification (CTC) criterion. By this design, an
intention can be identified when sufficient evidence has been accumulated, and
multiple intentions can be identified sequentially. We evaluate our solution on
the Fluent Speech Commands (FSC) dataset and the intent detection accuracy is
about 97 % on all multi-intent settings. This result is comparable to the
performance of the state-of-the-art non-streaming models, but is achieved in an
online and incremental way. We also employ our model to a keyword spotting task
using the Google Speech Commands dataset and the results are also highly
promising.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Contextual Biasing of Language Models for Speech Recognition in
Goal-Oriented Conversational Agents [11.193867567895353]
Goal-oriented conversational interfaces are designed to accomplish specific tasks.
We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time.
Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
arXiv Detail & Related papers (2021-03-18T15:38:08Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z) - End-to-end speech-to-dialog-act recognition [38.58540444573232]
We present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process.
In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer.
The entire network is fine-tuned in an end-to-end manner.
arXiv Detail & Related papers (2020-04-23T18:44:27Z) - Temporarily-Aware Context Modelling using Generative Adversarial
Networks for Speech Activity Detection [43.662221486962274]
We propose a novel joint learning framework for Speech Activity Detection (SAD)
We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment.
We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC.
arXiv Detail & Related papers (2020-04-02T02:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.