Contrastive Learning from Synthetic Audio Doppelgangers
- URL: http://arxiv.org/abs/2406.05923v1
- Date: Sun, 9 Jun 2024 21:44:06 GMT
- Title: Contrastive Learning from Synthetic Audio Doppelgangers
- Authors: Manuel Cherep, Nikhil Singh,
- Abstract summary: We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio.
By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg"angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes.
Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks.
- Score: 1.3754952818114714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg\"angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through transformations of existing audio, provide a rich source of contrastive information. Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.
Related papers
- Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features.
To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z) - Developing an Effective Training Dataset to Enhance the Performance of AI-based Speaker Separation Systems [0.3277163122167434]
We propose a novel method for constructing a realistic training set that includes mixture signals and corresponding ground truths for each speaker.
We get a 1.65 dB improvement in Scale Invariant Signal to Distortion Ratio (SI-SDR) for speaker separation accuracy in realistic mixing.
arXiv Detail & Related papers (2024-11-13T06:55:18Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Synthesizer Sound Matching Using Audio Spectrogram Transformers [2.5944208050492183]
We introduce a synthesizer sound matching model based on the Audio Spectrogram Transformer.
We show that this model can reconstruct parameters of samples generated from a set of 16 parameters.
We also provide audio examples demonstrating the out-of-domain model performance in emulating vocal imitations.
arXiv Detail & Related papers (2024-07-23T16:58:14Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Using growth transform dynamical systems for spatio-temporal data
sonification [9.721342507747158]
Sonification, or encoding information in meaningful audio signatures, has several advantages in augmenting or replacing traditional visualization methods for human-in-the-loop decision-making.
This paper presents a novel framework for sonifying high-dimensional data using a complex growth transform dynamical system model.
Our algorithm takes as input the data and optimization parameters underlying the learning or prediction task and combines it with the psycho parameters defined by the user.
arXiv Detail & Related papers (2021-08-21T16:25:59Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.