Learning to Detect Novel and Fine-Grained Acoustic Sequences Using
Pretrained Audio Representations
- URL: http://arxiv.org/abs/2305.02382v1
- Date: Wed, 3 May 2023 18:41:24 GMT
- Title: Learning to Detect Novel and Fine-Grained Acoustic Sequences Using
Pretrained Audio Representations
- Authors: Vasudha Kowtha, Miquel Espi Marques, Jonathan Huang, Yichi Zhang,
Carlos Avendano
- Abstract summary: We develop procedures for pretraining suitable representations, and methods which transfer them to our few shot learning scenario.
Our experiments evaluate the general purpose utility of our pretrained representations on AudioSet.
Our pretrained embeddings are suitable to the proposed task, and enable multiple aspects of our few shot framework.
- Score: 17.043435238200605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates pretrained audio representations for few shot Sound
Event Detection. We specifically address the task of few shot detection of
novel acoustic sequences, or sound events with semantically meaningful temporal
structure, without assuming access to non-target audio. We develop procedures
for pretraining suitable representations, and methods which transfer them to
our few shot learning scenario. Our experiments evaluate the general purpose
utility of our pretrained representations on AudioSet, and the utility of
proposed few shot methods via tasks constructed from real-world acoustic
sequences. Our pretrained embeddings are suitable to the proposed task, and
enable multiple aspects of our few shot framework.
Related papers
- Learning Audio Concepts from Counterfactual Natural Language [34.118579918018725]
This study introduces causal reasoning and counterfactual analysis in the audio domain.
Our model considers acoustic characteristics and sound source information from human-annotated reference texts.
Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
arXiv Detail & Related papers (2024-01-10T05:15:09Z) - Pretraining Representations for Bioacoustic Few-shot Detection using
Supervised Contrastive Learning [10.395255631261458]
In bioacoustic applications, most tasks come with few labelled training data, because annotating long recordings is time consuming and costly.
We show that learning a rich feature extractor from scratch can be achieved by leveraging data augmentation using a supervised contrastive learning framework.
We obtain an F-score of 63.46% on the validation set and 42.7% on the test set, ranking second in the DCASE challenge.
arXiv Detail & Related papers (2023-09-02T09:38:55Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Audiovisual Masked Autoencoders [93.22646144125457]
We show that we can achieve significant improvements on audiovisual downstream classification tasks.
We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens.
arXiv Detail & Related papers (2022-12-09T17:34:53Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Audiovisual transfer learning for audio tagging and sound event
detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection.
We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features.
We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z) - Foreground-Background Ambient Sound Scene Separation [0.0]
We propose a deep learning-based separation framework with a suitable feature normaliza-tion scheme and an optional auxiliary network capturing the background statistics.
We conduct extensive experiments with mixtures of seen or unseen sound classes at various signal-to-noise ratios.
arXiv Detail & Related papers (2020-05-11T06:59:46Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.