Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation
- URL: http://arxiv.org/abs/2309.15938v1
- Date: Wed, 27 Sep 2023 18:23:03 GMT
- Title: Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation
- Authors: Xilin Jiang, Cong Han, Yinghao Aaron Li, Nima Mesgarani
- Abstract summary: MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
- Score: 21.896817015593122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we present a simple multi-channel framework for contrastive
learning (MC-SimCLR) to encode 'what' and 'where' of spatial audios. MC-SimCLR
learns joint spectral and spatial representations from unlabeled spatial
audios, thereby enhancing both event classification and sound localization in
downstream tasks. At its core, we propose a multi-level data augmentation
pipeline that augments different levels of audio features, including waveforms,
Mel spectrograms, and generalized cross-correlation (GCC) features. In
addition, we introduce simple yet effective channel-wise augmentation methods
to randomly swap the order of the microphones and mask Mel and GCC channels. By
using these augmentations, we find that linear layers on top of the learned
representation significantly outperform supervised models in terms of both
event classification accuracy and localization error. We also perform a
comprehensive analysis of the effect of each augmentation method and a
comparison of the fine-tuning performance using different amounts of labeled
data.
Related papers
- Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data.
We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising.
Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Spatial mixup: Directional loudness modification as data augmentation
for sound event localization and detection [9.0259157539478]
We propose Spatial Mixup as an application of parametric spatial audio effects for data augmentation.
The modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced.
The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline.
arXiv Detail & Related papers (2021-10-12T16:16:58Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM
Neural Networks [3.730592618611028]
We use LSTMs to enhance spatial clustering based time-frequency masks.
We achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance.
We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
arXiv Detail & Related papers (2020-12-02T22:29:29Z) - Utterance Clustering Using Stereo Audio Channels [0.3656826837859034]
This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals.
Experiments with real audio recordings of multi-person discussion sessions showed that the proposed method achieved significantly better performance than a conventional method.
arXiv Detail & Related papers (2020-09-10T18:25:33Z) - Improving Stability of LS-GANs for Audio and Speech Signals [70.15099665710336]
We show that encoding departure from normality computed in this vector space into the generator optimization formulation helps to craft more comprehensive spectrograms.
We demonstrate the effectiveness of binding this metric for enhancing stability in training with less mode collapse compared to baseline GANs.
arXiv Detail & Related papers (2020-08-12T17:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.