FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention
- URL: http://arxiv.org/abs/2010.14150v2
- Date: Mon, 3 May 2021 16:05:04 GMT
- Title: FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention
- Authors: Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, Lin-shan
Lee
- Abstract summary: We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
- Score: 66.77490220410249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Any-to-any voice conversion aims to convert the voice from and to any
speakers even unseen during training, which is much more challenging compared
to one-to-one or many-to-many tasks, but much more attractive in real-world
scenarios. In this paper we proposed FragmentVC, in which the latent phonetic
structure of the utterance from the source speaker is obtained from Wav2Vec
2.0, while the spectral features of the utterance(s) from the target speaker
are obtained from log mel-spectrograms. By aligning the hidden structures of
the two different feature spaces with a two-stage training process, FragmentVC
is able to extract fine-grained voice fragments from the target speaker
utterance(s) and fuse them into the desired utterance, all based on the
attention mechanism of Transformer as verified with analysis on attention maps,
and is accomplished end-to-end. This approach is trained with reconstruction
loss only without any disentanglement considerations between content and
speaker information and doesn't require parallel data. Objective evaluation
based on speaker verification and subjective evaluation with MOS both showed
that this approach outperformed SOTA approaches, such as AdaIN-VC and AutoVC.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Weakly Supervised Training of Hierarchical Attention Networks for
Speaker Identification [37.33388614967888]
A hierarchical attention network is proposed to solve a weakly labelled speaker identification problem.
The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally.
arXiv Detail & Related papers (2020-05-15T22:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.