Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix
- URL: http://arxiv.org/abs/2002.09026v1
- Date: Thu, 20 Feb 2020 21:33:07 GMT
- Title: Multi-label Sound Event Retrieval Using a Deep Learning-based Siamese
Structure with a Pairwise Presence Matrix
- Authors: Jianyu Fan, Eric Nichols, Daniel Tompkins, Ana Elisa Mendez Mendez,
Benjamin Elizalde, and Philippe Pasquier
- Abstract summary: State of the art sound event retrieval models have focused on single-label audio recordings.
We propose different Deep Learning architectures with a Siamese-structure and a Pairwise Presence Matrix.
The networks are trained and evaluated using the SONYC-UST dataset containing both single- and multi-label soundscape recordings.
- Score: 11.54047475139282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Realistic recordings of soundscapes often have multiple sound events
co-occurring, such as car horns, engine and human voices. Sound event retrieval
is a type of content-based search aiming at finding audio samples, similar to
an audio query based on their acoustic or semantic content. State of the art
sound event retrieval models have focused on single-label audio recordings,
with only one sound event occurring, rather than on multi-label audio
recordings (i.e., multiple sound events occur in one recording). To address
this latter problem, we propose different Deep Learning architectures with a
Siamese-structure and a Pairwise Presence Matrix. The networks are trained and
evaluated using the SONYC-UST dataset containing both single- and multi-label
soundscape recordings. The performance results show the effectiveness of our
proposed model.
Related papers
- AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Binaural Signal Representations for Joint Sound Event Detection and
Acoustic Scene Classification [3.300149824239397]
Sound event detection (SED) and Acoustic scene classification (ASC) are two widely researched audio tasks that constitute an important part of research on acoustic scene analysis.
Considering shared information between sound events and acoustic scenes, performing both tasks jointly is a natural part of a complex machine listening system.
In this paper, we investigate the usefulness of several spatial audio features in training a joint deep neural network (DNN) model performing SED and ASC.
arXiv Detail & Related papers (2022-09-13T11:29:00Z) - Visual Acoustic Matching [92.91522122739845]
We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment.
Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials.
arXiv Detail & Related papers (2022-02-14T17:05:22Z) - Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Exploiting Audio-Visual Consistency with Partial Supervision for Spatial
Audio Generation [45.526051369551915]
We propose an audio spatialization framework to convert a monaural video into a one exploiting the relationship across audio and visual components.
Experiments on benchmark datasets confirm the effectiveness of our proposed framework in both semi-supervised and fully supervised scenarios.
arXiv Detail & Related papers (2021-05-03T09:34:11Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.