Learning General Audio Representations with Large-Scale Training of
Patchout Audio Transformers
- URL: http://arxiv.org/abs/2211.13956v1
- Date: Fri, 25 Nov 2022 08:39:12 GMT
- Title: Learning General Audio Representations with Large-Scale Training of
Patchout Audio Transformers
- Authors: Khaled Koutini, Shahed Masoudian, Florian Schmid, Hamid Eghbal-zadeh,
Jan Schl\"uter, Gerhard Widmer
- Abstract summary: We study the use of audio transformers trained on large-scale datasets to learn general-purpose representations.
Our results show that representations extracted by audio transformers outperform CNN representations.
- Score: 6.002503434201551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of supervised deep learning methods is largely due to their
ability to learn relevant features from raw data. Deep Neural Networks (DNNs)
trained on large-scale datasets are capable of capturing a diverse set of
features, and learning a representation that can generalize onto unseen tasks
and datasets that are from the same domain. Hence, these models can be used as
powerful feature extractors, in combination with shallower models as
classifiers, for smaller tasks and datasets where the amount of training data
is insufficient for learning an end-to-end model from scratch. During the past
years, Convolutional Neural Networks (CNNs) have largely been the method of
choice for audio processing. However, recently attention-based transformer
models have demonstrated great potential in supervised settings, outperforming
CNNs. In this work, we investigate the use of audio transformers trained on
large-scale datasets to learn general-purpose representations. We study how the
different setups in these audio transformers affect the quality of their
embeddings. We experiment with the models' time resolution, extracted embedding
level, and receptive fields in order to see how they affect performance on a
variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge
evaluation setup. Our results show that representations extracted by audio
transformers outperform CNN representations. Furthermore, we will show that
transformers trained on Audioset can be extremely effective representation
extractors for a wide range of downstream tasks.
Related papers
- Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Do sound event representations generalize to other audio tasks? A case
study in audio transfer learning [20.572846660950812]
In this paper, we investigate transfer learning capacity of audio representations obtained from neural networks trained on a large-scale sound event detection dataset.
We show that such simple linear transfer is already powerful enough to achieve high performance on the downstream tasks.
arXiv Detail & Related papers (2021-06-21T18:04:59Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - High-Fidelity Audio Generation and Representation Learning with Guided
Adversarial Autoencoder [2.6770746621108654]
We propose a new autoencoder based model named "Guided Adversarial Autoencoder (GAAE)"
Our proposed model can generate audio with superior quality, which is indistinguishable from the real audio samples.
arXiv Detail & Related papers (2020-06-01T12:19:32Z) - Laplacian Denoising Autoencoder [114.21219514831343]
We propose to learn data representations with a novel type of denoising autoencoder.
The noisy input data is generated by corrupting latent clean data in the gradient domain.
Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
arXiv Detail & Related papers (2020-03-30T16:52:39Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.