Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion
- URL: http://arxiv.org/abs/2308.06382v2
- Date: Sat, 30 Dec 2023 22:48:13 GMT
- Title: Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion
- Authors: Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva
- Abstract summary: Voice conversion aims at altering a person's voice to make it sound similar to the voice of another person while preserving linguistic content.
Existing methods suffer from a dilemma between content intelligibility and speaker similarity.
We propose a novel method textitPhoneme Hallucinator that achieves the best of both worlds.
- Score: 12.064177287199822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice conversion (VC) aims at altering a person's voice to make it sound
similar to the voice of another person while preserving linguistic content.
Existing methods suffer from a dilemma between content intelligibility and
speaker similarity; i.e., methods with higher intelligibility usually have a
lower speaker similarity, while methods with higher speaker similarity usually
require plenty of target speaker voice data to achieve high intelligibility. In
this work, we propose a novel method \textit{Phoneme Hallucinator} that
achieves the best of both worlds. Phoneme Hallucinator is a one-shot VC model;
it adopts a novel model to hallucinate diversified and high-fidelity target
speaker phonemes based just on a short target speaker voice (e.g. 3 seconds).
The hallucinated phonemes are then exploited to perform neighbor-based voice
conversion. Our model is a text-free, any-to-any VC model that requires no text
annotations and supports conversion to any unseen speaker. Objective and
subjective evaluations show that \textit{Phoneme Hallucinator} outperforms
existing VC methods for both intelligibility and speaker similarity.
Related papers
- VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching [0.7366405857677227]
VoicePrompter is a robust zero-shot voice conversion model that leverages in-context learning with voice prompts.
We show that VoicePrompter outperforms existing zero-shot VC systems in terms of speaker similarity, speech intelligibility, and audio quality.
arXiv Detail & Related papers (2025-01-29T12:34:58Z) - Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - Creating New Voices using Normalizing Flows [16.747198180269127]
We investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities.
We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis.
arXiv Detail & Related papers (2023-12-22T10:00:24Z) - SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross
Attention [24.842378497026154]
SEF-VC is a speaker embedding free voice conversion model.
It learns and incorporates speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism.
It reconstructs waveform from HuBERT semantic tokens in a non-autoregressive manner.
arXiv Detail & Related papers (2023-12-14T06:26:55Z) - Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech [25.707717591185386]
We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality.
All of our code and trained models are available, alongside static and interactive demos.
arXiv Detail & Related papers (2022-06-24T11:54:59Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Latent linguistic embedding for cross-lingual text-to-speech and voice
conversion [44.700803634034486]
Cross-lingual speech generation is the scenario in which speech utterances are generated with the voices of target speakers in a language not spoken by them originally.
We show that our method not only creates cross-lingual VC with high speaker similarity but also can be seamlessly used for cross-lingual TTS without having to perform any extra steps.
arXiv Detail & Related papers (2020-10-08T01:25:07Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.