Optimal Transport Maps are Good Voice Converters
- URL: http://arxiv.org/abs/2411.02402v1
- Date: Thu, 17 Oct 2024 22:48:53 GMT
- Title: Optimal Transport Maps are Good Voice Converters
- Authors: Arip Asadulaev, Rostislav Korst, Vitalii Shutov, Alexander Korotin, Yaroslav Grebnyak, Vahe Egiazarian, Evgeny Burnaev,
- Abstract summary: We present a variety of optimal transport algorithms for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models.
For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD)
We achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
- Score: 58.42556113055807
- License:
- Abstract: Recently, neural network-based methods for computing optimal transport maps have been effectively applied to style transfer problems. However, the application of these methods to voice conversion is underexplored. In our paper, we fill this gap by investigating optimal transport as a framework for voice conversion. We present a variety of optimal transport algorithms designed for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models. For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD). This performance is consistent with our theoretical analysis, which suggests that our method provides an upper bound on the FAD between the target and generated distributions. Within the latent space of the WavLM encoder, we achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
Related papers
- Gaussian Flow Bridges for Audio Domain Transfer with Unpaired Data [20.181313153447412]
This paper investigates the potential of Gaussian Flow Bridges, an emerging approach in generative modeling, for this problem.
The presented framework addresses the transport problem across different distributions of audio signals through the implementation of a series of two deterministic probability flows.
To address identified challenges on maintaining the speech content consistent, we recommend a training strategy that incorporates chunk-based minibatch Optimal Transport couplings of data samples and noise.
arXiv Detail & Related papers (2024-05-29T20:23:01Z) - Surrogate Modeling of Trajectory Map-matching in Urban Road Networks using Transformer Sequence-to-Sequence Model [1.3812010983144802]
This paper introduces a deep-learning model, specifically the transformer-based encoder-decoder model, to perform as a surrogate for offline map-matching algorithms.
The model is trained and evaluated using GPS traces collected in Manhattan, New York.
arXiv Detail & Related papers (2024-04-18T18:39:23Z) - Efficient Neural Network Approaches for Conditional Optimal Transport with Applications in Bayesian Inference [1.740133468405535]
We present two neural network approaches that approximate the solutions of static and conditional optimal transport (COT) problems.
We demonstrate both algorithms, comparing them with competing state-the-art approaches, using benchmark datasets and simulation-based inverse problems.
arXiv Detail & Related papers (2023-10-25T20:20:09Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Leveraging Symmetrical Convolutional Transformer Networks for Speech to
Singing Voice Style Transfer [49.01417720472321]
We develop a novel neural network architecture, called SymNet, which models the alignment of the input speech with the target melody.
Experiments are performed on the NUS and NHSS datasets which consist of parallel data of speech and singing voice.
arXiv Detail & Related papers (2022-08-26T02:54:57Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - Timbre Transfer with Variational Auto Encoding and Cycle-Consistent
Adversarial Networks [0.6445605125467573]
This research project investigates the application of deep learning to timbre transfer, where the timbre of a source audio can be converted to the timbre of a target audio with minimal loss in quality.
The adopted approach combines Variational Autoencoders with Generative Adversarial Networks to construct meaningful representations of the source audio and produce realistic generations of the target audio.
arXiv Detail & Related papers (2021-09-05T15:06:53Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Feature Robust Optimal Transport for High-dimensional Data [125.04654605998618]
We propose feature-robust optimal transport (FROT) for high-dimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality.
We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.
arXiv Detail & Related papers (2020-05-25T14:07:16Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.