Efficient Training of Audio Transformers with Patchout
- URL: http://arxiv.org/abs/2110.05069v1
- Date: Mon, 11 Oct 2021 08:07:50 GMT
- Title: Efficient Training of Audio Transformers with Patchout
- Authors: Khaled Koutini, Jan Schl\"uter, Hamid Eghbal-zadeh, Gerhard Widmer
- Abstract summary: We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
- Score: 7.073210405344709
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The great success of transformer-based models in natural language processing
(NLP) has led to various attempts at adapting these architectures to other
domains such as vision and audio. Recent work has shown that transformers can
outperform Convolutional Neural Networks (CNNs) on vision and audio tasks.
However, one of the main shortcomings of transformer models, compared to the
well-established CNNs, is the computational complexity. Compute and memory
complexity grow quadratically with the input length. Therefore, there has been
extensive work on optimizing transformers, but often at the cost of lower
predictive performance. In this work, we propose a novel method to optimize and
regularize transformers on audio spectrograms. The proposed models achieve a
new state-of-the-art performance on Audioset and can be trained on a single
consumer-grade GPU. Furthermore, we propose a transformer model that
outperforms CNNs in terms of both performance and training speed.
Related papers
- From Coarse to Fine: Efficient Training for Audio Spectrogram
Transformers [16.90294414874585]
We introduce multi-phase training of audio spectrogram transformers by connecting the idea of coarse-to-fine with transformer models.
By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy.
arXiv Detail & Related papers (2024-01-16T14:59:37Z) - Dynamic Convolutional Neural Networks as Efficient Pre-trained Audio
Models [4.803510486360358]
Current popular Audio Spectrogram Transformers are demanding in terms of computational complexity compared to CNNs.
We introduce dynamic CNN blocks constructed of dynamic non-linearities, dynamic convolutions and attention mechanisms.
Our experiments indicate that the introduced dynamic CNNs achieve better performance on downstream tasks and scale up well.
arXiv Detail & Related papers (2023-10-24T09:08:20Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CHAPTER: Exploiting Convolutional Neural Network Adapters for
Self-supervised Speech Models [62.60723685118747]
Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data.
We propose an efficient tuning method specifically designed for SSL speech model, by applying CNN adapters at the feature extractor.
We empirically found that adding CNN to the feature extractor can help the adaptation on emotion and speaker tasks.
arXiv Detail & Related papers (2022-12-01T08:50:12Z) - Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Parameter Efficient Multimodal Transformers for Video Representation
Learning [108.8517364784009]
This work focuses on reducing the parameters of multimodal Transformers in the context of audio-visual video representation learning.
We show that our approach reduces parameters up to 80$%$, allowing us to train our model end-to-end from scratch.
To demonstrate our approach, we pretrain our model on 30-second clips from Kinetics-700 and transfer it to audio-visual classification tasks.
arXiv Detail & Related papers (2020-12-08T00:16:13Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.