Multiscale Audio Spectrogram Transformer for Efficient Audio
Classification
- URL: http://arxiv.org/abs/2303.10757v1
- Date: Sun, 19 Mar 2023 20:21:29 GMT
- Title: Multiscale Audio Spectrogram Transformer for Efficient Audio
Classification
- Authors: Wentao Zhu, Mohamed Omar
- Abstract summary: We develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification.
Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions.
- Score: 1.797470734877199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio event has a hierarchical architecture in both time and frequency and
can be grouped together to construct more abstract semantic audio classes. In
this work, we develop a multiscale audio spectrogram Transformer (MAST) that
employs hierarchical representation learning for efficient audio
classification. Specifically, MAST employs one-dimensional (and
two-dimensional) pooling operators along the time (and frequency domains) in
different stages, and progressively reduces the number of tokens and increases
the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast}
by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound
in terms of the top-1 accuracy without external training data. On the
downloaded AudioSet dataset, which has over 20\% missing audios, MAST also
achieves slightly better accuracy than AST. In addition, MAST is 5x more
efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the
number of parameters compared to AST. Through clustering metrics and
visualizations, we demonstrate that the proposed MAST can learn semantically
more separable feature representations from audio signals.
Related papers
- Taming Data and Transformers for Audio Generation [49.54707963286065]
AutoCap is a high-quality and efficient automatic audio captioning model.
GenAu is a scalable transformer-based audio generation architecture.
We compile 57M ambient audio clips, forming AutoReCap-XL, the largest available audio-text dataset.
arXiv Detail & Related papers (2024-06-27T17:58:54Z) - Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video
Classification [6.341420717393898]
We develop a novel multiscale audio Transformer (MAT) and a multiscale video Transformer (MMT)
The proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets.
It is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage.
arXiv Detail & Related papers (2024-01-08T17:02:25Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - Transformer-based Sequence Labeling for Audio Classification based on
MFCCs [0.0]
This paper proposes a Transformer-encoder-based model for audio classification using MFCCs.
The model was benchmarked against the ESC-50, Speech Commands v0.02 and UrbanSound8k datasets and has shown strong performance.
The model consisted of a mere 127,544 total parameters, making it light-weight yet highly efficient at the audio classification task.
arXiv Detail & Related papers (2023-04-30T07:25:43Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.