Related papers: A Streaming End-to-End Framework For Spoken Language Understanding

A Streaming End-to-End Framework For Spoken Language Understanding

URL: http://arxiv.org/abs/2105.10042v1
Date: Thu, 20 May 2021 21:37:05 GMT
Title: A Streaming End-to-End Framework For Spoken Language Understanding
Authors: Nihal Potdar, Anderson R. Avila, Chao Xing, Dong Wang, Yiran Cao, Xiao Chen
Abstract summary: We propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. We evaluate our solution on the Fluent Speech Commands dataset and the intent detection accuracy is about 97 % on all multi-intent settings.
Score: 11.58499117295424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end spoken language understanding (SLU) has recently attracted increasing interest. Compared to the conventional tandem-based approach that combines speech recognition and language understanding as separate modules, the new approach extracts users' intentions directly from the speech signals, resulting in joint optimization and low latency. Such an approach, however, is typically designed to process one intention at a time, which leads users to take multiple rounds to fulfill their requirements while interacting with a dialogue system. In this paper, we propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. The backbone of our framework is a unidirectional RNN trained with the connectionist temporal classification (CTC) criterion. By this design, an intention can be identified when sufficient evidence has been accumulated, and multiple intentions can be identified sequentially. We evaluate our solution on the Fluent Speech Commands (FSC) dataset and the intent detection accuracy is about 97 % on all multi-intent settings. This result is comparable to the performance of the state-of-the-art non-streaming models, but is achieved in an online and incremental way. We also employ our model to a keyword spotting task using the Google Speech Commands dataset and the results are also highly promising.

Related papers

Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z)
Intent-Aware Dialogue Generation and Multi-Task Contrastive Learning for Multi-Turn Intent Classification [6.459396785817196]
Chain-of-Intent generates intent-driven conversations through self-play. MINT-CL is a framework for multi-turn intent classification using multi-task contrastive learning. We release MINT-E, a multilingual, intent-aware multi-turn e-commerce dialogue corpus.
arXiv Detail & Related papers (2024-11-21T15:59:29Z)
Improved intent classification based on context information using a windows-based approach [0.0]
The intent classification task aims at identifying what a user is attempting to achieve from an utterance. Previous works use only the current utterance to predict the intent of a given query. We propose several approaches to investigate the role of contextual information for the intent classification task.
arXiv Detail & Related papers (2024-11-09T00:56:02Z)
Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance. We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information. Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Integrating Self-supervised Speech Model with Pseudo Word-level Targets from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process. Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z)
End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned. We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
Contextual Biasing of Language Models for Speech Recognition in Goal-Oriented Conversational Agents [11.193867567895353]
Goal-oriented conversational interfaces are designed to accomplish specific tasks. We propose a new architecture that utilizes context embeddings derived from BERT on sample utterances provided during inference time. Our experiments show a word error rate (WER) relative reduction of 7% over non-contextual utterance-level NLM rescorers on goal-oriented audio datasets.
arXiv Detail & Related papers (2021-03-18T15:38:08Z)
Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity. We present a discriminative nearest neighbor classification with deep self-attention. We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z)
Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection [43.662221486962274]
We propose a novel joint learning framework for Speech Activity Detection (SAD) We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment. We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC.
arXiv Detail & Related papers (2020-04-02T02:33:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.