Speech Understanding on Tiny Devices with A Learning Cache
- URL: http://arxiv.org/abs/2311.18188v4
- Date: Wed, 8 May 2024 17:08:52 GMT
- Title: Speech Understanding on Tiny Devices with A Learning Cache
- Authors: Afsara Benazir, Zhiming Xu, Felix Xiaozhu Lin,
- Abstract summary: SpeechCache (or SC) is a speech cache for tiny devices.
We implement SC on an off-the-shelf STM32 microcontroller.
Our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services.
- Score: 2.7186799067647334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses spoken language understanding (SLU) on microcontroller-like embedded devices, integrating on-device execution with cloud offloading in a novel fashion. We leverage temporal locality in the speech inputs to a device and reuse recent SLU inferences accordingly. Our idea is simple: let the device match incoming inputs against cached results, and only offload inputs not matched to any cached ones to the cloud for full inference. Realization of this idea, however, is non-trivial: the device needs to compare acoustic features in a robust yet low-cost way. To this end, we present SpeechCache (or SC), a speech cache for tiny devices. It matches speech inputs at two levels of representations: first by sequences of clustered raw sound units, then as sequences of phonemes. Working in tandem, the two representations offer complementary tradeoffs between cost and efficiency. To boost accuracy even further, our cache learns to personalize: with the mismatched and then offloaded inputs, it continuously finetunes the device's feature extractors with the assistance of the cloud. We implement SC on an off-the-shelf STM32 microcontroller. The complete implementation has a small memory footprint of 2MB. Evaluated on challenging speech benchmarks, our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services. The benefit brought by our proposed SC is notable even in adversarial settings - noisy environments, cold cache, or one device shared by a number of users.
Related papers
- Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models [0.9699101045941684]
We propose a novel use: enhancing speech privacy on resource-constrained devices.
We introduce XYZ, an edge/cloud privacy preserving speech inference engine.
Our solution leads to robust speech recognition without forsaking privacy.
arXiv Detail & Related papers (2025-01-29T18:55:42Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations [13.020158123538138]
Speech separation guided diarization (SSGD) performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream.
We consider three state-of-the-art speech separation (SSep) algorithms and study their performance in online and offline scenarios.
We show that our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model.
arXiv Detail & Related papers (2023-03-21T16:33:56Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Implicit Acoustic Echo Cancellation for Keyword Spotting and
Device-Directed Speech Detection [2.7393821783237184]
In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio.
We propose an implicit acoustic echo cancellation framework where a neural network is trained to exploit the additional information from a reference microphone channel.
We show a $56%$ reduction in false-reject rate for the DDD task during device playback conditions.
arXiv Detail & Related papers (2021-11-20T17:21:16Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Streaming on-device detection of device directed speech from voice and
touch-based invocation [12.42440115067583]
We propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection.
To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN)
To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion.
arXiv Detail & Related papers (2021-10-09T22:33:42Z) - LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search [127.56834100382878]
We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech.
Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
arXiv Detail & Related papers (2021-02-08T07:45:06Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - TinySpeech: Attention Condensers for Deep Speech Recognition Neural
Networks on Edge Devices [71.68436132514542]
We introduce the concept of attention condensers for building low-footprint, highly-efficient deep neural networks for on-device speech recognition on the edge.
To illustrate its efficacy, we introduce TinySpeech, low-precision deep neural networks tailored for on-device speech recognition.
arXiv Detail & Related papers (2020-08-10T16:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.