Related papers: Who is Authentic Speaker

Who is Authentic Speaker

URL: http://arxiv.org/abs/2405.00248v1
Date: Tue, 30 Apr 2024 23:41:00 GMT
Title: Who is Authentic Speaker
Authors: Qiang Huang,
Abstract summary: Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
Score: 4.822108779108675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Voice conversion (VC) using deep learning technologies can now generate high quality one-to-many voices and thus has been used in some practical application fields, such as entertainment and healthcare. However, voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes. Moreover, it is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly. In this paper we attempt to explore the feasibility of identifying authentic speakers from converted voices. This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices. Therefore our experiments are geared towards recognising the source speakers given the converted voices, which are generated by using FragmentVC on the randomly paired utterances from source and target speakers. To improve the robustness against converted voices, our recognition model is constructed by using hierarchical vector of locally aggregated descriptors (VLAD) in deep neural networks. The authentic speaker recognition system is mainly tested in two aspects, including the impact of quality of converted voices and the variations of VLAD. The dataset used in this work is VCTK corpus, where source and target speakers are randomly paired. The results obtained on the converted utterances show promising performances in recognising authentic speakers from converted voices.

Related papers

Quantifying Source Speaker Leakage in One-to-One Voice Conversion [32.92816245915008]
We show that it is possible to quantify a degree of confidence about a source speaker's identity in the case of one-to-one voice conversion.
arXiv Detail & Related papers (2025-04-22T12:09:03Z)
Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation [66.49076386263509]
This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. We propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models.
arXiv Detail & Related papers (2025-01-11T00:47:29Z)
Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity. Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent. This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z)
Catch You and I Can: Revealing Source Voiceprint Against Voice Conversion [0.0]
We make the first attempt to restore the source voiceprint from audios synthesized by voice conversion methods with high credit. We develop Revelio, a representation learning model, which learns to effectively extract the voiceprint of the source speaker from converted audio samples.
arXiv Detail & Related papers (2023-02-24T03:33:13Z)
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning. A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z)
Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings. In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z)
On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z)
Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion. The method has the capability to disentangle speaker identity and linguistic content from utterances. It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Defending Your Voice: Adversarial Attack on Voice Conversion [70.19396655909455]
We report the first known attempt to perform adversarial attack on voice conversion. We introduce human noise imperceptible into the utterances of a speaker whose voice is to be defended. It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker.
arXiv Detail & Related papers (2020-05-18T14:51:54Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
Many-to-Many Voice Conversion using Conditional Cycle-Consistent Adversarial Networks [3.1317409221921144]
We extend the CycleGAN by conditioning the network on speakers. The proposed method can perform many-to-many voice conversion among multiple speakers using a single generative adversarial network (GAN) Compared to building multiple CycleGANs for each pair of speakers, the proposed method reduces the computational and spatial cost significantly without compromising the sound quality of the converted voice.
arXiv Detail & Related papers (2020-02-15T06:03:36Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.