End-to-end streaming model for low-latency speech anonymization
- URL: http://arxiv.org/abs/2406.09277v2
- Date: Thu, 31 Oct 2024 20:45:16 GMT
- Title: End-to-end streaming model for low-latency speech anonymization
- Authors: Waris Quamer, Ricardo Gutierrez-Osuna,
- Abstract summary: We propose a streaming model that achieves speaker anonymization with low latency.
The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder.
We present evaluation results from two implementations of our system.
- Score: 11.098498920630782
- License:
- Abstract: Speaker anonymization aims to conceal cues to speaker identity while preserving linguistic content. Current machine learning based approaches require substantial computational resources, hindering real-time streaming applications. To address these concerns, we propose a streaming model that achieves speaker anonymization with low latency. The system is trained in an end-to-end autoencoder fashion using a lightweight content encoder that extracts HuBERT-like information, a pretrained speaker encoder that extract speaker identity, and a variance encoder that injects pitch and energy information. These three disentangled representations are fed to a decoder that re-synthesizes the speech signal. We present evaluation results from two implementations of our system, a full model that achieves a latency of 230ms, and a lite version (0.1x in size) that further reduces latency to 66ms while maintaining state-of-the-art performance in naturalness, intelligibility, and privacy preservation.
Related papers
- Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition
With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition.
With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Streaming Attention-Based Models with Augmented Memory for End-to-End
Speech Recognition [26.530909772863417]
We build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution.
The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory.
On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.
arXiv Detail & Related papers (2020-11-03T00:43:58Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.