Related papers: From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

URL: http://arxiv.org/abs/2401.08415v1
Date: Tue, 16 Jan 2024 14:59:37 GMT
Title: From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers
Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak
Abstract summary: We introduce multi-phase training of audio spectrogram transformers by connecting the idea of coarse-to-fine with transformer models. By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy.
Score: 16.90294414874585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-phase training of audio spectrogram transformers by connecting the seminal idea of coarse-to-fine with transformer models. To achieve this, we propose a set of methods for temporal compression. By employing one of these methods, the transformer model learns from lower-resolution (coarse) data in the initial phases, and then is fine-tuned with high-resolution data later in a curriculum learning strategy. Experimental results demonstrate that the proposed training mechanism for AST leads to improved (or on-par) performance with faster convergence, i.e. requiring fewer computational resources and less time. This approach is also generalizable to other AST-based methods regardless of their learning paradigms.

Related papers

ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions [15.472819870523093]
Transformer-based models, such as the Audio Spectrogram Transformers (AST), inherit the fixed-size input paradigm from CNNs. This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference.
arXiv Detail & Related papers (2024-07-11T17:29:56Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
A Survey on Efficient Training of Transformers [72.31868024970674]
This survey provides the first systematic overview of the efficient training of Transformers. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design.
arXiv Detail & Related papers (2023-02-02T13:58:18Z)
Transform Once: Efficient Operator Learning in Frequency Domain [69.74509540521397]
We study deep neural networks designed to harness the structure in frequency domain for efficient learning of long-range correlations in space or time. This work introduces a blueprint for frequency domain learning through a single transform: transform once (T1)
arXiv Detail & Related papers (2022-11-26T01:56:05Z)
MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z)
SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z)
Study of positional encoding approaches for Audio Spectrogram Transformers [16.829474982595837]
In this paper, we study one component of the Audio Spectrogram Transformer (AST) and propose several variants to improve its performance. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.
arXiv Detail & Related papers (2021-10-13T19:20:20Z)
Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms. The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z)
Audiomer: A Convolutional Transformer for Keyword Spotting [0.0]
We introduce Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting. Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.
arXiv Detail & Related papers (2021-09-21T15:28:41Z)
Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting. We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.