Speaker Anonymization with Phonetic Intermediate Representations
- URL: http://arxiv.org/abs/2207.04834v1
- Date: Mon, 11 Jul 2022 13:02:08 GMT
- Title: Speaker Anonymization with Phonetic Intermediate Representations
- Authors: Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli,
Ngoc Thang Vu
- Abstract summary: We propose a speaker anonymization pipeline that generates speech conditioned on phonetic transcriptions and anonymized speaker embeddings.
Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input.
- Score: 22.84840887071428
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we propose a speaker anonymization pipeline that leverages high
quality automatic speech recognition and synthesis systems to generate speech
conditioned on phonetic transcriptions and anonymized speaker embeddings. Using
phones as the intermediate representation ensures near complete elimination of
speaker identity information from the input while preserving the original
phonetic content as much as possible. Our experimental results on LibriSpeech
and VCTK corpora reveal two key findings: 1) although automatic speech
recognition produces imperfect transcriptions, our neural speech synthesis
system can handle such errors, making our system feasible and robust, and 2)
combining speaker embeddings from different resources is beneficial and their
appropriate normalization is crucial. Overall, our final best system
outperforms significantly the baselines provided in the Voice Privacy Challenge
2020 in terms of privacy robustness against a lazy-informed attacker while
maintaining high intelligibility and naturalness of the anonymized speech.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech
Recognition [75.12948999653338]
We propose a novel multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR)
We employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data.
Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
arXiv Detail & Related papers (2022-11-29T13:16:09Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Improving Self-Supervised Speech Representations by Disentangling
Speakers [56.486084431528695]
Self-supervised learning in speech involves training a speech representation network on a large-scale unannotated speech corpus.
Disentangling speakers is very challenging, because removing the speaker information could easily result in a loss of content as well.
We propose a new SSL method that can achieve speaker disentanglement without severe loss of content.
arXiv Detail & Related papers (2022-04-20T04:56:14Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Differentially Private Speaker Anonymization [44.90119821614047]
Sharing real-world speech utterances is key to the training and deployment of voice-based services.
Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact.
We show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information.
arXiv Detail & Related papers (2022-02-23T23:20:30Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.