MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
- URL: http://arxiv.org/abs/2203.16691v1
- Date: Wed, 30 Mar 2022 22:06:13 GMT
- Title: MAE-AST: Masked Autoencoding Audio Spectrogram Transformer
- Authors: Alan Baade, Puyuan Peng, David Harwath
- Abstract summary: We propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification.
We leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens.
We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST.
- Score: 11.814012909512307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a simple yet powerful improvement over the recent
Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and
audio classification. Specifically, we leverage the insight that the SSAST uses
a very high masking ratio (75%) during pretraining, meaning that the vast
majority of self-attention compute is performed on mask tokens. We address this
by integrating the encoder-decoder architecture from Masked Autoencoders are
Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on
only unmasked input, and a shallow decoder operates on encoder outputs and mask
tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x
memory usage reduction over the vanilla SSAST using current audio pretraining
strategies with ordinary model and input sizes. When fine-tuning on downstream
tasks, which only uses the encoder, we find that our approach outperforms the
SSAST on a variety of downstream tasks. We further conduct comprehensive
evaluations into different strategies of pretraining and explore differences in
MAE-style pretraining between the visual and audio domains.
Related papers
- Extending Video Masked Autoencoders to 128 frames [75.01251612160829]
Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice.
However, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding.
We propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random
arXiv Detail & Related papers (2024-11-20T20:00:38Z) - Rethinking Patch Dependence for Masked Autoencoders [92.37365660775171]
We re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE)
We propose a novel pretraining framework: Cross-Attention Masked Autoencoders (CrossMAE)
arXiv Detail & Related papers (2024-01-25T18:49:57Z) - Drop your Decoder: Pre-training with Bag-of-Word Prediction for Dense Passage Retrieval [26.00149743478937]
Masked auto-encoder pre-training has emerged as a prevalent technique for initializing and enhancing dense retrieval systems.
We propose a modification to the traditional MAE by replacing the decoder of a masked auto-encoder with a completely simplified Bag-of-Word prediction task.
Our proposed method achieves state-of-the-art retrieval performance on several large-scale retrieval benchmarks without requiring any additional parameters.
arXiv Detail & Related papers (2024-01-20T15:02:33Z) - Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens.
We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation.
We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - RetroMAE: Pre-training Retrieval-oriented Transformers via Masked
Auto-Encoder [15.24707645921207]
We propose a novel pre-training framework for dense retrieval based on the Masked Auto-Encoder, known as RetroMAE.
We pre-train a BERT like encoder on English Wikipedia and BookCorpus, where it notably outperforms the existing pre-trained models on a wide range of dense retrieval benchmarks.
arXiv Detail & Related papers (2022-05-24T12:43:04Z) - Masked Autoencoders As Spatiotemporal Learners [60.83955416682043]
This paper studies a conceptually simple extension of Masked Autoencoders (MAE) totemporal representation learning from videos.
We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels.
We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data.
arXiv Detail & Related papers (2022-05-18T17:59:59Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.