End-to-End Neural Transformer Based Spoken Language Understanding
- URL: http://arxiv.org/abs/2008.10984v1
- Date: Wed, 12 Aug 2020 22:58:20 GMT
- Title: End-to-End Neural Transformer Based Spoken Language Understanding
- Authors: Martin Radfar, Athanasios Mouchtaris, and Siegfried Kunzmann
- Abstract summary: Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals.
We introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots embedded in an audio signal.
Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 %, 99.6 %, and 99.6 %, respectively.
- Score: 14.736425160859284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spoken language understanding (SLU) refers to the process of inferring the
semantic information from audio signals. While the neural transformers
consistently deliver the best performance among the state-of-the-art neural
architectures in field of natural language processing (NLP), their merits in a
closely related field, i.e., spoken language understanding (SLU) have not beed
investigated. In this paper, we introduce an end-to-end neural
transformer-based SLU model that can predict the variable-length domain,
intent, and slots vectors embedded in an audio signal with no intermediate
token prediction architecture. This new architecture leverages the
self-attention mechanism by which the audio signal is transformed to various
sub-subspaces allowing to extract the semantic context implied by an utterance.
Our end-to-end transformer SLU predicts the domains, intents and slots in the
Fluent Speech Commands dataset with accuracy equal to 98.1 \%, 99.6 \%, and
99.6 \%, respectively and outperforms the SLU models that leverage a
combination of recurrent and convolutional neural networks by 1.4 \% while the
size of our model is 25\% smaller than that of these architectures.
Additionally, due to independent sub-space projections in the self-attention
layer, the model is highly parallelizable which makes it a good candidate for
on-device SLU.
Related papers
- Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z) - End-to-end model for named entity recognition from speech without paired
training data [12.66131972249388]
We propose an approach to build an end-to-end neural model to extract semantic information.
Our approach is based on the use of an external model trained to generate a sequence of vectorial representations from text.
Experiments on named entity recognition, carried out on the QUAERO corpus, show that this approach is very promising.
arXiv Detail & Related papers (2022-04-02T08:14:27Z) - Multi-task RNN-T with Semantic Decoder for Streamable Spoken Language
Understanding [16.381644007368763]
End-to-end Spoken Language Understanding (E2E SLU) has attracted increasing interest due to its advantages of joint optimization and low latency.
We propose a streamable multi-task semantic transducer model to address these considerations.
Our proposed architecture predicts ASR and NLU labels auto-regressively and uses a semantic decoder to ingest both previously predicted word-pieces and slot tags.
arXiv Detail & Related papers (2022-04-01T16:38:56Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - End-to-End Spoken Language Understanding using RNN-Transducer ASR [14.267028645397266]
We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance.
It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface.
arXiv Detail & Related papers (2021-06-30T09:20:32Z) - End-to-End Spoken Language Understanding for Generalized Voice
Assistants [15.241812584273886]
We present our approach to developing an E2E model for generalized speech recognition in commercial voice assistants (VAs)
We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels.
This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations.
arXiv Detail & Related papers (2021-06-16T17:56:47Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.