Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
- URL: http://arxiv.org/abs/2601.13948v2
- Date: Tue, 27 Jan 2026 05:25:34 GMT
- Title: Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models
- Authors: Nikita Kuzmin, Songting Liu, Kong Aik Lee, Eng Siong Chng,
- Abstract summary: Stream-Voice-Anon adapts modern causal LM-based NAC architectures specifically for streaming speaker anonymization.<n>Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies.<n>Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility.
- Score: 51.7170633585748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 46% relative WER reduction) and emotion preservation (up to 28% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.
Related papers
- WavInWav: Time-domain Speech Hiding via Invertible Neural Network [78.85443308774484]
Previous audio hiding methods often result in unsatisfactory quality when recovering secret audio.<n>We use a flow-based invertible neural network to establish a direct link between stego audio, cover audio, and secret audio.<n>We also add an encryption technique to protect the hidden data from unauthorized access.
arXiv Detail & Related papers (2025-10-03T11:36:16Z) - VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks [51.68795949691009]
We introduce VoxGuard, a framework grounded in differential privacy and membership inference.<n>For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization.<n>Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation.
arXiv Detail & Related papers (2025-09-22T20:57:48Z) - DarkStream: real-time speech anonymization with low latency [5.872253202878362]
We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization.<n>DarkStream combines a causal waveform encoder, a short look buffer, and transformer-based contextual layers.<n>DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder.
arXiv Detail & Related papers (2025-09-04T21:30:25Z) - Quantum-Inspired Audio Unlearning: Towards Privacy-Preserving Voice Biometrics [44.60499998155848]
QPAudioEraser is a quantum-inspired audio unlearning framework.<n>It consistently surpasses conventional baselines across single-class, multi-class, sequential, and accent-level erasure scenarios.
arXiv Detail & Related papers (2025-07-29T20:12:24Z) - Whispering Under the Eaves: Protecting User Privacy Against Commercial and LLM-powered Automatic Speech Recognition Systems [20.45938874279563]
We propose a novel framework, AudioShield, to protect speech against automatic speech recognition systems.<n>By transferring the perturbations to the latent space, the audio quality is preserved to a large extent.<n> AudioShield shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.
arXiv Detail & Related papers (2025-04-01T14:49:39Z) - End-to-end streaming model for low-latency speech anonymization [11.098498920630782]
We propose a streaming model that achieves speaker anonymization with low latency.
The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder.
We present evaluation results from two implementations of our system.
arXiv Detail & Related papers (2024-06-13T16:15:53Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.