Pre-training with Synthetic Patterns for Audio
- URL: http://arxiv.org/abs/2410.00511v1
- Date: Tue, 1 Oct 2024 08:52:35 GMT
- Title: Pre-training with Synthetic Patterns for Audio
- Authors: Yuchi Ishikawa, Tatsuya Komatsu, Yoshimitsu Aoki,
- Abstract summary: We propose to pre-train audio encoders using synthetic patterns instead of real audio data.
Our framework achieves performance comparable to models pre-trained on AudioSet-2M.
- Score: 18.769951782213973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose to pre-train audio encoders using synthetic patterns instead of real audio data. Our proposed framework consists of two key elements. The first one is Masked Autoencoder (MAE), a self-supervised learning framework that learns from reconstructing data from randomly masked counterparts. MAEs tend to focus on low-level information such as visual patterns and regularities within data. Therefore, it is unimportant what is portrayed in the input, whether it be images, audio mel-spectrograms, or even synthetic patterns. This leads to the second key element, which is synthetic data. Synthetic data, unlike real audio, is free from privacy and licensing infringement issues. By combining MAEs and synthetic patterns, our framework enables the model to learn generalized feature representations without real data, while addressing the issues related to real audio. To evaluate the efficacy of our framework, we conduct extensive experiments across a total of 13 audio tasks and 17 synthetic datasets. The experiments provide insights into which types of synthetic patterns are effective for audio. Our results demonstrate that our framework achieves performance comparable to models pre-trained on AudioSet-2M and partially outperforms image-based pre-training methods.
Related papers
- Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - Measuring Sound Symbolism in Audio-visual Models [21.876743976994614]
This study investigates whether pre-trained audio-visual models demonstrate associations between sounds and visual representations.
Our findings reveal connections to human language processing, providing insights in cognitive architectures and machine learning strategies.
arXiv Detail & Related papers (2024-09-18T20:33:54Z) - Contrastive Learning from Synthetic Audio Doppelgangers [1.3754952818114714]
We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio.
By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelg"angers-synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes.
Despite the shift to randomly generated synthetic data, our method produces strong representations, competitive with real data on standard audio classification benchmarks.
arXiv Detail & Related papers (2024-06-09T21:44:06Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue
State Tracking [19.754211231250544]
We develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data.
Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data.
arXiv Detail & Related papers (2023-12-04T12:25:46Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.