Discrete Optimal Transport and Voice Conversion
- URL: http://arxiv.org/abs/2505.04382v2
- Date: Thu, 10 Jul 2025 12:54:15 GMT
- Title: Discrete Optimal Transport and Voice Conversion
- Authors: Anton Selitskiy, Maitreya Kocharekar,
- Abstract summary: We employ discrete optimal transport mapping to align audio embeddings between speakers.<n>Applying discrete optimal transport as a post-processing step in audio generation can lead to the incorrect classification of synthetic audio as real.
- Score: 0.552480439325792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we address the voice conversion (VC) task using a vector-based interface. To align audio embeddings between speakers, we employ discrete optimal transport mapping. Our evaluation results demonstrate the high quality and effectiveness of this method. Additionally, we show that applying discrete optimal transport as a post-processing step in audio generation can lead to the incorrect classification of synthetic audio as real.
Related papers
- Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples [55.2480439325792]
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs.<n>These models often hallucinate non-existent sound events, reducing their reliability in real-world applications.<n>We propose LISTEN, a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds.
arXiv Detail & Related papers (2025-05-20T15:44:01Z) - Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior [23.448790295875828]
Style Transfer with Inference-Time optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track.<n>We introduce a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space.<n>The resulting optimisation is equivalent to maximum-a-posteriori estimation.
arXiv Detail & Related papers (2025-05-16T14:40:31Z) - LAVCap: LLM-based Audio-Visual Captioning using Optimal Transport [16.108957027494604]
LAVCap is a large language model (LLM)-based audio-visual captioning framework.<n>It integrates visual information with audio to improve audio captioning performance.<n>It outperforms existing state-of-the-art methods on the AudioCaps dataset.
arXiv Detail & Related papers (2025-01-16T04:53:29Z) - Optimal Transport Maps are Good Voice Converters [58.42556113055807]
We present a variety of optimal transport algorithms for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models.
For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD)
We achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
arXiv Detail & Related papers (2024-10-17T22:48:53Z) - Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning [43.43337861152684]
Voicebox Adapter is a novel approach that integrates fine-grained conditions into a pre-trained Voicebox speech generation model.
Our experiment shows that the LoRA with bias-tuning configuration yields the best performance.
arXiv Detail & Related papers (2024-06-10T13:31:18Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Audio Contrastive based Fine-tuning [21.145936249583446]
We introduce Audio Contrastive-based Fine-tuning (AudioConFit) as an efficient approach characterised by robust generalisability.
Empirical experiments on a variety of audio classification tasks demonstrate the effectiveness and robustness of our approach.
arXiv Detail & Related papers (2023-09-21T08:59:13Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting.
The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model.
Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.