EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
- URL: http://arxiv.org/abs/2401.03497v1
- Date: Sun, 7 Jan 2024 14:31:27 GMT
- Title: EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
- Authors: Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
- Abstract summary: Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
- Score: 2.443213094810588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio self-supervised learning (SSL) pre-training, which aims to learn good
representations from unlabeled audio, has made remarkable progress. However,
the extensive computational demands during pre-training pose a significant
barrier to the potential application and optimization of audio SSL models. In
this paper, inspired by the success of data2vec 2.0 in image modality and
Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to
further improve the effectiveness and efficiency in audio SSL. The proposed EAT
adopts the bootstrap self-supervised training paradigm to the audio domain. A
novel Utterance-Frame Objective (UFO) is designed to enhance the modeling
capability of acoustic events. Furthermore, we reveal that the masking strategy
is critical in audio SSL pre-training, and superior audio representations can
be obtained with large inverse block masks. Experiment results demonstrate that
EAT achieves state-of-the-art (SOTA) performance on a range of audio-related
tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a
significant pre-training speedup up to ~15x compared to existing audio SSL
models.
Related papers
- Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024 [3.4947857354806633]
This report proposes an improved method for the Temporal Sound Localisation task.
It localizes and classifies the sound events occurring in the video according to a predefined set of sound classes.
Our approach ranks first in the final test with a score of 0.4925.
arXiv Detail & Related papers (2024-09-29T07:28:21Z) - SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model [12.399378490833818]
Self-Supervised Audio Mamba (SSAMBA) is the first self-supervised, attention-free, and SSM-based model for audio representation learning.
Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks.
arXiv Detail & Related papers (2024-05-20T06:58:47Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations.
Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks.
For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.