Contrastive Environmental Sound Representation Learning
- URL: http://arxiv.org/abs/2207.08825v1
- Date: Mon, 18 Jul 2022 16:56:30 GMT
- Title: Contrastive Environmental Sound Representation Learning
- Authors: Peter Ochieng, Dennis Kaburu
- Abstract summary: We exploit the self-supervised contrastive technique and a shallow 1D CNN to extract the distinctive audio features (audio representations) without using any explicit annotations.
We generate representations of a given audio using both its raw audio waveform and spectrogram and evaluate if the proposed learner is agnostic to the type of audio input.
- Score: 6.85316573653194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine hearing of the environmental sound is one of the important issues in
the audio recognition domain. It gives the machine the ability to discriminate
between the different input sounds that guides its decision making. In this
work we exploit the self-supervised contrastive technique and a shallow 1D CNN
to extract the distinctive audio features (audio representations) without using
any explicit annotations.We generate representations of a given audio using
both its raw audio waveform and spectrogram and evaluate if the proposed
learner is agnostic to the type of audio input. We further use canonical
correlation analysis (CCA) to fuse representations from the two types of input
of a given audio and demonstrate that the fused global feature results in
robust representation of the audio signal as compared to the individual
representations. The evaluation of the proposed technique is done on both
ESC-50 and UrbanSound8K. The results show that the proposed technique is able
to extract most features of the environmental audio and gives an improvement of
12.8% and 0.9% on the ESC-50 and UrbanSound8K datasets respectively.
Related papers
- AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - XAI-based Comparison of Input Representations for Audio Event
Classification [10.874097312428235]
We leverage eXplainable AI (XAI) to understand the underlying classification strategies of models trained on different input representations.
Specifically, we compare two model architectures with regard to relevant input features used for Audio Event Detection.
arXiv Detail & Related papers (2023-04-27T08:30:07Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data [9.072124914105325]
We present an audiovisual fusion model that learns to recognize sounds from weakly labeled video recordings.
Experiments on the large scale sound events dataset, AudioSet, demonstrate the efficacy of the proposed model.
arXiv Detail & Related papers (2020-05-29T01:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.