Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
- URL: http://arxiv.org/abs/2505.11315v1
- Date: Fri, 16 May 2025 14:40:31 GMT
- Title: Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
- Authors: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas,
- Abstract summary: Style Transfer with Inference-Time optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track.<n>We introduce a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space.<n>The resulting optimisation is equivalent to maximum-a-posteriori estimation.
- Score: 23.448790295875828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method's superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.
Related papers
- Investigating the Sensitivity of Pre-trained Audio Embeddings to Common Effects [4.202522944371801]
We investigate the sensitivity to audio effects of audio embeddings extracted from widely-used foundation models.<n>By applying parameterized audio effects, we analyze the correlation between the deformation trajectories and the effect strength in the embedding space.<n>We find that there exists a direction along which the embeddings move monotonically as the audio effect strength increases, but that the subspace containing the displacements is generally high-dimensional.
arXiv Detail & Related papers (2025-01-27T09:49:08Z) - Optimal Transport Maps are Good Voice Converters [58.42556113055807]
We present a variety of optimal transport algorithms for different data representations, such as mel-spectrograms and latent representation of self-supervised speech models.
For the mel-spectogram data representation, we achieve strong results in terms of Frechet Audio Distance (FAD)
We achived state-of-the-art results and outperformed existing methods even with limited reference speaker data.
arXiv Detail & Related papers (2024-10-17T22:48:53Z) - DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model.<n>Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.<n>This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Improve Noise Tolerance of Robust Loss via Noise-Awareness [60.34670515595074]
We propose a meta-learning method which is capable of adaptively learning a hyper parameter prediction function, called Noise-Aware-Robust-Loss-Adjuster (NARL-Adjuster for brevity)
Four SOTA robust loss functions are attempted to be integrated with our algorithm, and comprehensive experiments substantiate the general availability and effectiveness of the proposed method in both its noise tolerance and performance.
arXiv Detail & Related papers (2023-01-18T04:54:58Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Partial Identification with Noisy Covariates: A Robust Optimization
Approach [94.10051154390237]
Causal inference from observational datasets often relies on measuring and adjusting for covariates.
We show that this robust optimization approach can extend a wide range of causal adjustment methods to perform partial identification.
Across synthetic and real datasets, we find that this approach provides ATE bounds with a higher coverage probability than existing methods.
arXiv Detail & Related papers (2022-02-22T04:24:26Z) - Spatial mixup: Directional loudness modification as data augmentation
for sound event localization and detection [9.0259157539478]
We propose Spatial Mixup as an application of parametric spatial audio effects for data augmentation.
The modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced.
The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline.
arXiv Detail & Related papers (2021-10-12T16:16:58Z) - Parameterized Channel Normalization for Far-field Deep Speaker
Verification [21.237143465298505]
We focus on two parametric normalization methods: per-channel energy normalization (PCEN) and parameterized cepstral mean normalization (PCMN)
We evaluate the performance on Hi-MIA, a recent large-scale far-field speech corpus, with varied microphone and positional settings.
Our methods outperform conventional mel filterbank features, with maximum of 33.5% and 39.5% relative improvement on equal error rate under matched microphone and mismatched microphone conditions, respectively.
arXiv Detail & Related papers (2021-09-24T16:22:31Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z) - Exploring Quality and Generalizability in Parameterized Neural Audio
Effects [0.0]
Deep neural networks have shown promise for music audio signal processing applications.
Results to date have tended to be constrained by low sample rates, noise, narrow domains of signal types, and/or lack of parameterized controls.
This work expands on prior research published on modeling nonlinear time-dependent signal processing effects.
arXiv Detail & Related papers (2020-06-10T00:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.