FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
- URL: http://arxiv.org/abs/2210.15418v1
- Date: Thu, 27 Oct 2022 13:32:38 GMT
- Title: FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
- Authors: Jingyi li, Weiping tu, Li xiao
- Abstract summary: We adopt the end-to-end framework of VITS for high-quality waveform reconstruction.
We disentangle content information by imposing an information bottleneck to WavLM features.
We propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.
- Score: 17.274784447811665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion (VC) can be achieved by first extracting source content
information and target speaker information, and then reconstructing waveform
with these information. However, current approaches normally either extract
dirty content information with speaker information leaked in, or demand a large
amount of annotated data for training. Besides, the quality of reconstructed
waveform can be degraded by the mismatch between conversion model and vocoder.
In this paper, we adopt the end-to-end framework of VITS for high-quality
waveform reconstruction, and propose strategies for clean content information
extraction without text annotation. We disentangle content information by
imposing an information bottleneck to WavLM features, and propose the
spectrogram-resize based data augmentation to improve the purity of extracted
content information. Experimental results show that the proposed method
outperforms the latest VC models trained with annotated data and has greater
robustness.
Related papers
- VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space [0.49109372384514843]
VQalAttent is a lightweight model designed to generate fake speech with tunable performance and interpretability.
Our results demonstrate VQalAttent's capacity to generate intelligible speech samples with limited computational resources.
arXiv Detail & Related papers (2024-11-22T00:21:39Z) - Maintaining Informative Coherence: Migrating Hallucinations in Large Language Models via Absorbing Markov Chains [6.920249042435973]
Large Language Models (LLMs) are powerful tools for text generation, translation, and summarization.
LLMs often suffer from hallucinations-instances where they fail to maintain the fidelity and coherence of contextual information.
We propose a novel decoding strategy that leverages absorbing Markov chains to quantify the significance of contextual information.
arXiv Detail & Related papers (2024-10-27T04:51:18Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation [35.76451156732993]
We introduce the information bottleneck theory into retrieval-augmented generation.
Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output.
We derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations.
arXiv Detail & Related papers (2024-06-03T17:31:06Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - Towards Improved Zero-shot Voice Conversion with Conditional DSVAE [30.376259456529368]
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion.
We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling.
We demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy.
arXiv Detail & Related papers (2022-05-11T01:19:42Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - StreamHover: Livestream Transcript Summarization and Annotation [54.41877742041611]
We present StreamHover, a framework for annotating and summarizing livestream transcripts.
With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora.
We show that our model generalizes better and improves performance over strong baselines.
arXiv Detail & Related papers (2021-09-11T02:19:37Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.