FlexiAST: Flexibility is What AST Needs
- URL: http://arxiv.org/abs/2307.09286v1
- Date: Tue, 18 Jul 2023 14:30:47 GMT
- Title: FlexiAST: Flexibility is What AST Needs
- Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak
- Abstract summary: The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST)
Recent advancements in ASTs have shown superior performance in various audio-based tasks.
- Score: 21.07980558948832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of this work is to give patch-size flexibility to Audio
Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior
performance in various audio-based tasks. However, the performance of standard
ASTs degrades drastically when evaluated using different patch sizes from that
used during training. As a result, AST models are typically re-trained to
accommodate changes in patch sizes. To overcome this limitation, this paper
proposes a training procedure to provide flexibility to standard AST models
without architectural changes, allowing them to work with various patch sizes
at the inference stage - FlexiAST. This proposed training approach simply
utilizes random patch size selection and resizing of patch and positional
embedding weights. Our experiments show that FlexiAST gives similar performance
to standard AST models while maintaining its evaluation ability at various
patch sizes on different datasets for audio classification tasks.
Related papers
- ElasticAST: An Audio Spectrogram Transformer for All Length and Resolutions [15.472819870523093]
Transformer-based models, such as the Audio Spectrogram Transformers (AST), inherit the fixed-size input paradigm from CNNs.
This paper introduces an approach that enables the use of variable-length audio inputs with AST models during both training and inference.
arXiv Detail & Related papers (2024-07-11T17:29:56Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - Efficient Stitchable Task Adaptation [47.94819192325723]
We present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models.
Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches.
We streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy.
arXiv Detail & Related papers (2023-11-29T04:31:35Z) - Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging [60.79382212029304]
Massively multilingual language models have displayed strong performance in zero-shot (ZS-XLT) and few-shot (FS-XLT) cross-lingual transfer setups.
We propose a simple and effective method that averages different checkpoints (i.e., model snapshots) during task fine-tuning.
arXiv Detail & Related papers (2023-05-26T11:24:32Z) - TOAST: Transfer Learning via Attention Steering [77.83191769502763]
Current transfer learning methods often fail to focus on task-relevant features.
We introduce Top-Down Attention Steering (TOAST), a novel transfer learning algorithm that steers the attention to task-specific features.
TOAST substantially improves performance across a range of fine-grained visual classification datasets.
arXiv Detail & Related papers (2023-05-24T20:03:04Z) - LAST: Scalable Lattice-Based Speech Modelling in JAX [11.682949982063477]
We introduce LAST, a LAttice-based Speech Transducer library in JAX.
Last implements differentiable weighted finite state automaton (WFSA) algorithms needed for training & inference that scale to a large WFSA.
We describe a suite of generally applicable techniques employed in LAST to address these challenges, and demonstrate their effectiveness with benchmarks on TPUv3 and V100 GPU.
arXiv Detail & Related papers (2023-04-25T20:25:37Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - SSAST: Self-Supervised Audio Spectrogram Transformer [19.09439093130855]
We propose to pretrain the Audio Spectrogram Transformer (AST) model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio.
We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification.
To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.
arXiv Detail & Related papers (2021-10-19T07:58:28Z) - Study of positional encoding approaches for Audio Spectrogram
Transformers [16.829474982595837]
In this paper, we study one component of the Audio Spectrogram Transformer (AST) and propose several variants to improve its performance.
Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.
arXiv Detail & Related papers (2021-10-13T19:20:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.