Related papers: Automatic Voice Identification after Speech Resynthesis using PPG

Automatic Voice Identification after Speech Resynthesis using PPG

URL: http://arxiv.org/abs/2408.02712v1
Date: Mon, 5 Aug 2024 13:59:40 GMT
Title: Automatic Voice Identification after Speech Resynthesis using PPG
Authors: Thibault Gaudier, Marie Tahon, Anthony Larcher, Yannick Estève,
Abstract summary: Speech resynthesis is a generic task for which we want to synthesize audio with another audio as input. This paper presents a PPG-based speech resynthesis system. A perceptive evaluation assesses that it produces correct audio quality.
Score: 13.041006302302808
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech resynthesis is a generic task for which we want to synthesize audio with another audio as input, which finds applications for media monitors and journalists.Among different tasks addressed by speech resynthesis, voice conversion preserves the linguistic information while modifying the identity of the speaker, and speech edition preserves the identity of the speaker but some words are modified.In both cases, we need to disentangle speaker and phonetic contents in intermediate representations.Phonetic PosteriorGrams (PPG) are a frame-level probabilistic representation of phonemes, and are usually considered speaker-independent.This paper presents a PPG-based speech resynthesis system.A perceptive evaluation assesses that it produces correct audio quality.Then, we demonstrate that an automatic speaker verification model is not able to recover the source speaker after re-synthesis with PPG, even when the model is trained on synthetic data.

Related papers

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples. We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model. Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z)
ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z)
A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech. We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z)
Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z)
On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z)
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis. We extract low-bitrate representations for speech content, prosodic information, and speaker identity. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z)
From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis. We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network. The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.