Related papers: End-to-end streaming model for low-latency speech anonymization

End-to-end streaming model for low-latency speech anonymization

URL: http://arxiv.org/abs/2406.09277v2
Date: Thu, 31 Oct 2024 20:45:16 GMT
Title: End-to-end streaming model for low-latency speech anonymization
Authors: Waris Quamer, Ricardo Gutierrez-Osuna,
Abstract summary: We propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder. We present evaluation results from two implementations of our system.
Score: 11.098498920630782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.

Related papers

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models [51.7170633585748]
Stream-Voice-Anon adapts modern causal LM-based NAC architectures specifically for streaming speaker anonymization.<n>Our anonymization approach incorporates pseudo-speaker representation sampling, a speaker embedding mixing and diverse prompt selection strategies.<n>Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility.
arXiv Detail & Related papers (2026-01-20T13:23:44Z)
DarkStream: real-time speech anonymization with low latency [5.872253202878362]
We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization.<n>DarkStream combines a causal waveform encoder, a short look buffer, and transformer-based contextual layers.<n>DarkStream anonymizes speaker identity by injecting a GAN-generated pseudo-speaker embedding into linguistic features from the content encoder.
arXiv Detail & Related papers (2025-09-04T21:30:25Z)
SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec [83.61175662066364]
Speech codecs serve as a crucial bridge in unifying speech and text language models.<n>Existing methods face several challenges in semantic encoding.<n>We propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codecs.
arXiv Detail & Related papers (2025-08-04T19:22:14Z)
Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context. To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z)
Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z)
High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z)
Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes. The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z)
Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition. With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z)
SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio. SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
WNARS: WFST based Non-autoregressive Streaming End-to-End Speech Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers. On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z)
Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition [26.530909772863417]
We build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.
arXiv Detail & Related papers (2020-11-03T00:43:58Z)
VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user. We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z)
Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR. We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.