End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting
- URL: http://arxiv.org/abs/2207.08179v1
- Date: Sun, 17 Jul 2022 13:51:56 GMT
- Title: End-to-End Spoken Language Understanding: Performance analyses of a
voice command task in a low resource setting
- Authors: Thierry Desot, Fran\c{c}ois Portet, Michel Vacher
- Abstract summary: We present a study identifying the signal features and other linguistic properties used by an E2E model to perform the Spoken Language Understanding task.
The study is carried out in the application domain of a smart home that has to handle non-English (here French) voice commands.
- Score: 0.3867363075280543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spoken Language Understanding (SLU) is a core task in most human-machine
interaction systems. With the emergence of smart homes, smart phones and smart
speakers, SLU has become a key technology for the industry. In a classical SLU
approach, an Automatic Speech Recognition (ASR) module transcribes the speech
signal into a textual representation from which a Natural Language
Understanding (NLU) module extracts semantic information. Recently End-to-End
SLU (E2E SLU) based on Deep Neural Networks has gained momentum since it
benefits from the joint optimization of the ASR and the NLU parts, hence
limiting the cascade of error effect of the pipeline architecture. However,
little is known about the actual linguistic properties used by E2E models to
predict concepts and intents from speech input. In this paper, we present a
study identifying the signal features and other linguistic properties used by
an E2E model to perform the SLU task. The study is carried out in the
application domain of a smart home that has to handle non-English (here French)
voice commands. The results show that a good E2E SLU performance does not
always require a perfect ASR capability. Furthermore, the results show the
superior capabilities of the E2E model in handling background noise and
syntactic variation compared to the pipeline model. Finally, a finer-grained
analysis suggests that the E2E model uses the pitch information of the input
signal to identify voice command concepts. The results and methodology outlined
in this paper provide a springboard for further analyses of E2E models in
speech processing.
Related papers
- Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Modality Confidence Aware Training for Robust End-to-End Spoken Language
Understanding [18.616202196061966]
End-to-end (E2E) spoken language understanding (SLU) systems that generate a semantic parse from speech have become more promising recently.
This approach uses a single model that utilizes audio and text representations from pre-trained speech recognition models (ASR)
We propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses.
arXiv Detail & Related papers (2023-07-22T17:47:31Z) - Two-Pass Low Latency End-to-End Spoken Language Understanding [36.81762807197944]
We incorporated language models pre-trained on unlabeled text data inside E2E-SLU frameworks to build strong semantic representations.
We developed a 2-pass SLU system that makes low latency prediction using acoustic information from the few seconds of the audio in the first pass.
Our code and models are publicly available as part of the ESPnet-SLU toolkit.
arXiv Detail & Related papers (2022-07-14T05:50:16Z) - Finstreder: Simple and fast Spoken Language Understanding with Finite
State Transducers using modern Speech-to-Text models [69.35569554213679]
In Spoken Language Understanding (SLU) the task is to extract important information from audio commands.
This paper presents a simple method for embedding intents and entities into Finite State Transducers.
arXiv Detail & Related papers (2022-06-29T12:49:53Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language
Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency.
We propose a streamable multi-task semantic transducer model to address these considerations.
Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z) - RNN Transducer Models For Spoken Language Understanding [49.07149742835825]
We show how RNN-T SLU models can be developed starting from pre-trained automatic speech recognition systems.
In settings where real audio data is not available, artificially synthesized speech is used to successfully adapt various SLU models.
arXiv Detail & Related papers (2021-04-08T15:35:22Z) - Speech-language Pre-training for End-to-end Spoken Language
Understanding [18.548949994603213]
We propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder.
The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method.
arXiv Detail & Related papers (2021-02-11T21:55:48Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.