Adversarial Representation Learning for Robust Privacy Preservation in
Audio
- URL: http://arxiv.org/abs/2305.00011v2
- Date: Wed, 3 Jan 2024 13:51:05 GMT
- Title: Adversarial Representation Learning for Robust Privacy Preservation in
Audio
- Authors: Shayan Gharib, Minh Tran, Diep Luong, Konstantinos Drossos, Tuomas
Virtanen
- Abstract summary: Sound event detection systems may inadvertently reveal sensitive information about users or their surroundings.
We propose a novel adversarial training method for learning representations of audio recordings.
The proposed method is evaluated against a baseline approach with no privacy measures and a prior adversarial training method.
- Score: 11.409577482625053
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sound event detection systems are widely used in various applications such as
surveillance and environmental monitoring where data is automatically
collected, processed, and sent to a cloud for sound recognition. However, this
process may inadvertently reveal sensitive information about users or their
surroundings, hence raising privacy concerns. In this study, we propose a novel
adversarial training method for learning representations of audio recordings
that effectively prevents the detection of speech activity from the latent
features of the recordings. The proposed method trains a model to generate
invariant latent representations of speech-containing audio recordings that
cannot be distinguished from non-speech recordings by a speech classifier. The
novelty of our work is in the optimization algorithm, where the speech
classifier's weights are regularly replaced with the weights of classifiers
trained in a supervised manner. This increases the discrimination power of the
speech classifier constantly during the adversarial training, motivating the
model to generate latent representations in which speech is not
distinguishable, even using new speech classifiers trained outside the
adversarial training loop. The proposed method is evaluated against a baseline
approach with no privacy measures and a prior adversarial training method,
demonstrating a significant reduction in privacy violations compared to the
baseline approach. Additionally, we show that the prior adversarial method is
practically ineffective for this purpose.
Related papers
- Representation Learning for Audio Privacy Preservation using Source
Separation and Robust Adversarial Learning [16.1694012177079]
We propose the integration of two commonly used approaches in privacy preservation: source separation and adversarial representation learning.
The proposed system learns the latent representation of audio recordings such that it prevents differentiating between speech and non-speech recordings.
arXiv Detail & Related papers (2023-08-09T13:50:00Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Improving the Intent Classification accuracy in Noisy Environment [9.447108578893639]
In this paper, we investigate how environmental noise and related noise reduction techniques to address the intent classification task with end-to-end neural models.
For this task, the use of speech enhancement greatly improves the classification accuracy in noisy conditions.
arXiv Detail & Related papers (2023-03-12T06:11:44Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification [24.596224536399326]
We train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources.
We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data.
arXiv Detail & Related papers (2021-04-05T17:17:55Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.