PSLA: Improving Audio Event Classification with Pretraining, Sampling,
Labeling, and Aggregation
- URL: http://arxiv.org/abs/2102.01243v1
- Date: Tue, 2 Feb 2021 01:00:38 GMT
- Title: PSLA: Improving Audio Event Classification with Pretraining, Sampling,
Labeling, and Aggregation
- Authors: Yuan Gong, Yu-An Chung, and James Glass
- Abstract summary: We present PSLA, a collection of training techniques that can noticeably boost the model accuracy.
We obtain a model that achieves a new state-of-the-art mean average precision (mAP) of 0.474 on AudioSet, outperforming the previous best system of 0.439.
- Score: 19.09439093130855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio event classification is an active research area and has a wide range of
applications. Since the release of AudioSet, great progress has been made in
advancing the classification accuracy, which mostly comes from the development
of novel model architectures and attention modules. However, we find that
appropriate training techniques are equally important for building audio event
classification models with AudioSet, but have not received the attention they
deserve. To fill the gap, in this work, we present PSLA, a collection of
training techniques that can noticeably boost the model accuracy including
ImageNet pretraining, balanced sampling, data augmentation, label enhancement,
model aggregation and their design choices. By training an EfficientNet with
these techniques, we obtain a model that achieves a new state-of-the-art mean
average precision (mAP) of 0.474 on AudioSet, outperforming the previous best
system of 0.439.
Related papers
- Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models [53.48409081555687]
In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features.
We propose a simple yet effective model that only relies on feed-forward neural networks.
Our framework achieves state-of-the-art performance on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL.
arXiv Detail & Related papers (2024-04-09T13:39:37Z) - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.
Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Segment-level Metric Learning for Few-shot Bioacoustic Event Detection [56.59107110017436]
We propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization.
Our system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin.
arXiv Detail & Related papers (2022-07-15T22:41:30Z) - A study on joint modeling and data augmentation of multi-modalities for
audio-visual scene classification [64.59834310846516]
We propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC)
Our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
arXiv Detail & Related papers (2022-03-07T07:29:55Z) - AudioCLIP: Extending CLIP to Image, Text and Audio [6.585049648605185]
We present an extension of the CLIP model that handles audio in addition to text and images.
Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset.
AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task.
arXiv Detail & Related papers (2021-06-24T14:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.