A low latency ASR-free end to end spoken language understanding system
- URL: http://arxiv.org/abs/2011.04884v1
- Date: Tue, 10 Nov 2020 04:16:56 GMT
- Title: A low latency ASR-free end to end spoken language understanding system
- Authors: Mohamed Mhiri, Samuel Myer, Vikrant Singh Tomar
- Abstract summary: This work proposes a system that has a small enough footprint to run on small micro-controllers and embedded systems with minimal latency.
Given a streaming input speech signal, the proposed system can process it segment-by-segment without the need to have the entire stream at the moment of processing.
Experiments show that the proposed system yields state-of-the-art performance with the advantage of low latency and a much smaller model when compared to other published works on the same task.
- Score: 11.413018142161249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, developing a speech understanding system that classifies a
waveform to structured data, such as intents and slots, without first
transcribing the speech to text has emerged as an interesting research problem.
This work proposes such as system with an additional constraint of designing a
system that has a small enough footprint to run on small micro-controllers and
embedded systems with minimal latency. Given a streaming input speech signal,
the proposed system can process it segment-by-segment without the need to have
the entire stream at the moment of processing. The proposed system is evaluated
on the publicly available Fluent Speech Commands dataset. Experiments show that
the proposed system yields state-of-the-art performance with the advantage of
low latency and a much smaller model when compared to other published works on
the same task.
Related papers
- Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency [44.99833362998488]
The latency is the time span from audio input to the output of the corresponding speaker label.
The lowest latency is achieved for the DIART-pipeline with the embedding model pyannote/embedding.
The FS-EEND system shows a similarly good latency.
arXiv Detail & Related papers (2024-07-05T06:54:27Z) - Multimodal Data and Resource Efficient Device-Directed Speech Detection
with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase.
Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone.
We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Efficient Speech Quality Assessment using Self-supervised Framewise
Embeddings [13.12010504777376]
Speech quality assessment is essential for audio researchers, developers, speech and language pathologists, and system quality engineers.
Current state-of-the-art systems are based on framewise speech features (hand-engineered or learnable) combined with time dependency modeling.
This paper proposes an efficient system with results comparable to the best performing model in the ConferencingSpeech 2022 challenge.
arXiv Detail & Related papers (2022-11-12T11:57:08Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - STOP: A dataset for Spoken Task Oriented Semantic Parsing [66.14615249745448]
End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model.
We release the Spoken Task-Oriented semantic Parsing (STOP) dataset, the largest and most complex SLU dataset to be publicly available.
In addition to the human-recorded audio, we are releasing a TTS-generated version to benchmark the performance for low-resource domain adaptation of end-to-end SLU systems.
arXiv Detail & Related papers (2022-06-29T00:36:34Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Dissecting User-Perceived Latency of On-Device E2E Speech Recognition [34.645194215436966]
We show that factors affecting token emission latency, and endpointing behavior significantly impact on user-perceived latency (UPL)
We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.
arXiv Detail & Related papers (2021-04-06T00:55:11Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.