SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross
Attention
- URL: http://arxiv.org/abs/2312.08676v2
- Date: Tue, 30 Jan 2024 14:11:29 GMT
- Title: SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross
Attention
- Authors: Junjie Li, Yiwei Guo, Xie Chen, Kai Yu
- Abstract summary: SEF-VC is a speaker embedding free voice conversion model.
It learns and incorporates speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism.
It reconstructs waveform from HuBERT semantic tokens in a non-autoregressive manner.
- Score: 24.842378497026154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to
arbitrary unseen target speaker timbre, while keeping the linguistic content
unchanged. Although the voice of generated speech can be controlled by
providing the speaker embedding of the target speaker, the speaker similarity
still lags behind the ground truth recordings. In this paper, we propose
SEF-VC, a speaker embedding free voice conversion model, which is designed to
learn and incorporate speaker timbre from reference speech via a powerful
position-agnostic cross-attention mechanism, and then reconstruct waveform from
HuBERT semantic tokens in a non-autoregressive manner. The concise design of
SEF-VC enhances its training stability and voice conversion performance.
Objective and subjective evaluations demonstrate the superiority of SEF-VC to
generate high-quality speech with better similarity to target reference than
strong zero-shot VC baselines, even for very short reference speeches.
Related papers
- Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation [66.49076386263509]
This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation.
We propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space.
UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models.
arXiv Detail & Related papers (2025-01-11T00:47:29Z) - AdaptVC: High Quality Voice Conversion with Adaptive Learning [28.25726543043742]
Key challenge is to extract disentangled linguistic content from the source and voice style from the reference.
In this paper, we achieve successful disentanglement of content and speaker features by tuning self-supervised speech features with adapters.
The adapters are trained to dynamically encode nuanced features from rich self-supervised features, and the decoder fuses them to produce speech that accurately resembles the reference.
arXiv Detail & Related papers (2025-01-02T16:54:08Z) - Noro: A Noise-Robust One-shot Voice Conversion System with Hidden Speaker Representation Capabilities [29.692178856614014]
One-shot voice conversion (VC) aims to alter the timbre of speech from a source speaker to match that of a target speaker using just a single reference speech from the target.
Despite advancements in one-shot VC, its effectiveness decreases in real-world scenarios where reference speeches, often sourced from the internet, contain various disturbances like background noise.
arXiv Detail & Related papers (2024-11-29T15:18:01Z) - Zero-shot Voice Conversion with Diffusion Transformers [0.0]
Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker.
Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks.
We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training.
arXiv Detail & Related papers (2024-11-15T04:43:44Z) - High-Quality Automatic Voice Over with Accurate Alignment: Supervision
through Self-Supervised Discrete Speech Units [69.06657692891447]
We propose a novel AVO method leveraging the learning objective of self-supervised discrete speech unit prediction.
Experimental results show that our proposed method achieves remarkable lip-speech synchronization and high speech quality.
arXiv Detail & Related papers (2023-06-29T15:02:22Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.