AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes
- URL: http://arxiv.org/abs/2308.07221v6
- Date: Fri, 25 Aug 2023 12:33:22 GMT
- Title: AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes
- Authors: Zhaohui Li and Haitao Wang and Xinghua Jiang
- Abstract summary: We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
- Score: 6.375996974877916
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose a method named AudioFormer,which learns audio feature
representations through the acquisition of discrete acoustic codes and
subsequently fine-tunes them for audio classification tasks. Initially,we
introduce a novel perspective by considering the audio classification task as a
form of natural language understanding (NLU). Leveraging an existing neural
audio codec model,we generate discrete acoustic codes and utilize them to train
a masked language model (MLM),thereby obtaining audio feature representations.
Furthermore,we pioneer the integration of a Multi-Positive sample Contrastive
(MPC) learning approach. This method enables the learning of joint
representations among multiple discrete acoustic codes within the same audio
input. In our experiments,we treat discrete acoustic codes as textual data and
train a masked language model using a cloze-like methodology,ultimately
deriving high-quality audio representations. Notably,the MPC learning technique
effectively captures collaborative representations among distinct positive
samples. Our research outcomes demonstrate that AudioFormer attains
significantly improved performance compared to prevailing monomodal audio
classification models across multiple datasets,and even outperforms
audio-visual multimodal classification models on select datasets.
Specifically,our approach achieves remarkable results on datasets including
AudioSet (2M,20K),and FSD50K,with performance scores of 53.9,45.1,and
65.6,respectively. We have openly shared both the code and models:
https://github.com/LZH-0225/AudioFormer.git.
Related papers
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - AudioSetMix: Enhancing Audio-Language Datasets with LLM-Assisted Augmentations [1.2101820447447276]
Multi-modal learning in the audio-language domain has seen significant advancements in recent years.
However, audio-language learning faces challenges due to limited and lower-quality data compared to image-language tasks.
Our method systematically generates audio-caption pairs by augmenting audio clips with natural language labels and corresponding audio signal processing operations.
This scalable method produces AudioSetMix, a high-quality training dataset for text-and-audio related models.
arXiv Detail & Related papers (2024-05-17T21:08:58Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining [46.22290575167155]
This paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation.
Our framework introduces a general representation of audio, called "language of audio" (LOA)
arXiv Detail & Related papers (2023-08-10T17:55:13Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.
Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.