SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification
- URL: http://arxiv.org/abs/2103.01929v1
- Date: Tue, 2 Mar 2021 18:42:45 GMT
- Title: SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification
- Authors: Alireza Nasiri, and Jianjun Hu
- Abstract summary: SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
- Score: 0.6767885381740952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Environmental Sound Classification (ESC) is a challenging field of research
in non-speech audio processing. Most of current research in ESC focuses on
designing deep models with special architectures tailored for specific audio
datasets, which usually cannot exploit the intrinsic patterns in the data.
However recent studies have surprisingly shown that transfer learning from
models trained on ImageNet is a very effective technique in ESC. Herein, we
propose SoundCLR, a supervised contrastive learning method for effective
environment sound classification with state-of-the-art performance, which works
by learning representations that disentangle the samples of each class from
those of other classes. Our deep network models are trained by combining a
contrastive loss that contributes to a better probability output by the
classification layer with a cross-entropy loss on the output of the classifier
layer to map the samples to their respective 1-hot encoded labels. Due to the
comparatively small sizes of the available environmental sound datasets, we
propose and exploit a transfer learning and strong data augmentation pipeline
and apply the augmentations on both the sound signals and their log-mel
spectrograms before inputting them to the model. Our experiments show that our
masking based augmentation technique on the log-mel spectrograms can
significantly improve the recognition performance. Our extensive benchmark
experiments show that our hybrid deep network models trained with combined
contrastive and cross-entropy loss achieved the state-of-the-art performance on
three benchmark datasets ESC-10, ESC-50, and US8K with validation accuracies of
99.75\%, 93.4\%, and 86.49\% respectively. The ensemble version of our models
also outperforms other top ensemble methods. The code is available at
https://github.com/alireza-nasiri/SoundCLR.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation [21.896817015593122]
MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
arXiv Detail & Related papers (2023-09-27T18:23:03Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Improved Zero-Shot Audio Tagging & Classification with Patchout
Spectrogram Transformers [7.817685358710508]
Zero-Shot (ZS) learning overcomes restriction by predicting classes based on adaptable class descriptions.
This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning.
arXiv Detail & Related papers (2022-08-24T09:48:22Z) - Audio-Visual Scene Classification Using A Transfer Learning Based Joint
Optimization Strategy [26.975596225131824]
We propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task.
Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training.
arXiv Detail & Related papers (2022-04-25T03:37:02Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Augmentation Strategies for Learning with Noisy Labels [3.698228929379249]
We evaluate different augmentation strategies for algorithms tackling the "learning with noisy labels" problem.
We find that using one set of augmentations for loss modeling tasks and another set for learning is the most effective.
We introduce this augmentation strategy to the state-of-the-art technique and demonstrate that we can improve performance across all evaluated noise levels.
arXiv Detail & Related papers (2021-03-03T02:19:35Z) - An Ensemble of Convolutional Neural Networks for Audio Classification [9.174145063580882]
ensembles of CNNs for audio classification are presented and tested on three freely available audio classification datasets.
To the best of our knowledge, this is the most extensive study investigating ensembles of CNNs for audio classification.
arXiv Detail & Related papers (2020-07-15T19:41:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.