FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
- URL: http://arxiv.org/abs/2210.15418v1
- Date: Thu, 27 Oct 2022 13:32:38 GMT
- Title: FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
- Authors: Jingyi li, Weiping tu, Li xiao
- Abstract summary: We adopt the end-to-end framework of VITS for high-quality waveform reconstruction.
We disentangle content information by imposing an information bottleneck to WavLM features.
We propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.
- Score: 17.274784447811665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice conversion (VC) can be achieved by first extracting source content
information and target speaker information, and then reconstructing waveform
with these information. However, current approaches normally either extract
dirty content information with speaker information leaked in, or demand a large
amount of annotated data for training. Besides, the quality of reconstructed
waveform can be degraded by the mismatch between conversion model and vocoder.
In this paper, we adopt the end-to-end framework of VITS for high-quality
waveform reconstruction, and propose strategies for clean content information
extraction without text annotation. We disentangle content information by
imposing an information bottleneck to WavLM features, and propose the
spectrogram-resize based data augmentation to improve the purity of extracted
content information. Experimental results show that the proposed method
outperforms the latest VC models trained with annotated data and has greater
robustness.
Related papers
- An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation [35.76451156732993]
We introduce the information bottleneck theory into retrieval-augmented generation.
Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output.
We derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations.
arXiv Detail & Related papers (2024-06-03T17:31:06Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - A Large-scale Dataset for Audio-Language Representation Learning [54.933479346870506]
We present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs.
We construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and
Adversarial Training [32.35100329067037]
Novel voice conversion framework named $boldsymbol T$ext $boldsymbol G$uided $boldsymbol A$utoVC(TGAVC)
adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech.
Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.
arXiv Detail & Related papers (2022-08-08T10:33:36Z) - Towards Improved Zero-shot Voice Conversion with Conditional DSVAE [30.376259456529368]
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion.
We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling.
We demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy.
arXiv Detail & Related papers (2022-05-11T01:19:42Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - Identifying Introductions in Podcast Episodes from Automatically
Generated Transcripts [0.0]
We build a novel dataset of complete transcriptions of over 400 podcast episodes.
These introductions contain information about the episodes' topics, hosts, and guests.
We train three Transformer models based on the pre-trained BERT and different augmentation strategies.
arXiv Detail & Related papers (2021-10-14T00:34:51Z) - StreamHover: Livestream Transcript Summarization and Annotation [54.41877742041611]
We present StreamHover, a framework for annotating and summarizing livestream transcripts.
With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora.
We show that our model generalizes better and improves performance over strong baselines.
arXiv Detail & Related papers (2021-09-11T02:19:37Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.