Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
- URL: http://arxiv.org/abs/2502.01649v1
- Date: Wed, 29 Jan 2025 18:55:42 GMT
- Title: Privacy-Preserving Edge Speech Understanding with Tiny Foundation Models
- Authors: Afsara Benazir, Felix Xiaozhu Lin,
- Abstract summary: We propose a novel use: enhancing speech privacy on resource-constrained devices.
We introduce XYZ, an edge/cloud privacy preserving speech inference engine.
Our solution leads to robust speech recognition without forsaking privacy.
- Score: 0.9699101045941684
- License:
- Abstract: Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce XYZ, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of XYZ hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement XYZ on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. XYZ with < 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. XYZ is 16x smaller in memory and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.
Related papers
- Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR [74.38242498079627]
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable.
In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems.
arXiv Detail & Related papers (2024-09-13T13:01:09Z) - Speech privacy-preserving methods using secret key for convolutional neural network models and their robustness evaluation [5.762345156477736]
In environments where untrusted third parties provide CNN-based systems, ensuring the privacy of speech queries becomes essential.
This paper proposes encryption methods for speech queries using secret keys and a model structure that allows for encrypted queries to be accepted without decryption.
arXiv Detail & Related papers (2024-08-07T16:51:39Z) - Lightweight Protection for Privacy in Offloaded Speech Understanding [1.6317061277457001]
Cloud-based speech recognition systems pose privacy risks.
Disentanglement-based encoders require substantial memory and computational resources.
We introduce a novel system, XXX, optimized for such devices.
arXiv Detail & Related papers (2024-01-22T14:36:01Z) - Speech Understanding on Tiny Devices with A Learning Cache [2.7186799067647334]
SpeechCache (or SC) is a speech cache for tiny devices.
We implement SC on an off-the-shelf STM32 microcontroller.
Our system resolves 45%-90% of inputs on device, reducing the average latency by up to 80% compared to offloading to popular cloud speech recognition services.
arXiv Detail & Related papers (2023-11-30T02:15:07Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Syfer: Neural Obfuscation for Private Data Release [58.490998583666276]
We develop Syfer, a neural obfuscation method to protect against re-identification attacks.
Syfer composes trained layers with random neural networks to encode the original data.
It maintains the ability to predict diagnoses from the encoded data.
arXiv Detail & Related papers (2022-01-28T20:32:04Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Protecting gender and identity with disentangled speech representations [49.00162808063399]
We show that protecting gender information in speech is more effective than modelling speaker-identity information.
We present a novel way to encode gender information and disentangle two sensitive biometric identifiers.
arXiv Detail & Related papers (2021-04-22T13:31:41Z) - Paralinguistic Privacy Protection at the Edge [5.349852254138085]
We introduce EDGY, a representation learning framework that transforms and filters high-dimensional voice data to identify and contain sensitive attributes at the edge prior to offloading to the cloud.
Our results show that EDGY runs in tens of milliseconds with 0.2% relative improvement in ABX score or minimal performance penalties in learning linguistic representations from raw voice signals.
arXiv Detail & Related papers (2020-11-04T14:11:35Z) - Private Speech Classification with Secure Multiparty Computation [15.065527713259542]
We propose the first privacy-preserving solution for deep learning-based audio classification that is provably secure.
Our approach allows to classify a speech signal of one party with a deep neural network of another party without Bob ever seeing Alice's speech signal in an unencrypted manner.
We evaluate the efficiency-security-accuracy trade-off of the proposed solution in a use case for privacy-preserving emotion detection from speech with a convolutional neural network.
arXiv Detail & Related papers (2020-07-01T05:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.