BEATs: Audio Pre-Training with Acoustic Tokenizers
- URL: http://arxiv.org/abs/2212.09058v1
- Date: Sun, 18 Dec 2022 10:41:55 GMT
- Title: BEATs: Audio Pre-Training with Acoustic Tokenizers
- Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo
Chen, Furu Wei
- Abstract summary: Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
- Score: 77.8510930885778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The massive growth of self-supervised learning (SSL) has been witnessed in
language, vision, speech, and audio domains over the past few years. While
discrete label prediction is widely adopted for other modalities, the
state-of-the-art audio SSL models still employ reconstruction loss for
pre-training. Compared with reconstruction loss, semantic-rich discrete label
prediction encourages the SSL model to abstract the high-level audio semantics
and discard the redundant details as in human perception. However, a
semantic-rich acoustic tokenizer for general audio pre-training is usually not
straightforward to obtain, due to the continuous property of audio and
unavailable phoneme sequences like speech. To tackle this challenge, we propose
BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder
representation from Audio Transformers, where an acoustic tokenizer and an
audio SSL model are optimized by iterations. In the first iteration, we use
random projection as the acoustic tokenizer to train an audio SSL model in a
mask and label prediction manner. Then, we train an acoustic tokenizer for the
next iteration by distilling the semantic knowledge from the pre-trained or
fine-tuned audio SSL model. The iteration is repeated with the hope of mutual
promotion of the acoustic tokenizer and audio SSL model. The experimental
results demonstrate our acoustic tokenizers can generate discrete labels with
rich audio semantics and our audio SSL models achieve state-of-the-art results
across various audio classification benchmarks, even outperforming previous
models that use more training data and model parameters significantly.
Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for
audio-only models without using any external data, and 98.1% accuracy on
ESC-50. The code and pre-trained models are available at https://aka.ms/beats.
Related papers
- How Should We Extract Discrete Audio Tokens from Self-Supervised Models? [15.03039528965825]
This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks.
We propose a scalable solution to train a universal vocoder across multiple SSL layers.
arXiv Detail & Related papers (2024-06-15T20:43:07Z) - C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - Self-Supervised Learning for Speech Enhancement through Synthesis [5.924928860260821]
We propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech.
We demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation.
arXiv Detail & Related papers (2022-11-04T16:06:56Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.