Foreground-Background Ambient Sound Scene Separation
- URL: http://arxiv.org/abs/2005.07006v2
- Date: Mon, 27 Jul 2020 14:00:00 GMT
- Title: Foreground-Background Ambient Sound Scene Separation
- Authors: Michel Olvera (MULTISPEECH), Emmanuel Vincent (MULTISPEECH), Romain
Serizel (MULTISPEECH), Gilles Gasso (LITIS)
- Abstract summary: We propose a deep learning-based separation framework with a suitable feature normaliza-tion scheme and an optional auxiliary network capturing the background statistics.
We conduct extensive experiments with mixtures of seen or unseen sound classes at various signal-to-noise ratios.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ambient sound scenes typically comprise multiple short events occurring on
top of a somewhat stationary background. We consider the task of separating
these events from the background, which we call foreground-background ambient
sound scene separation. We propose a deep learning-based separation framework
with a suitable feature normaliza-tion scheme and an optional auxiliary network
capturing the background statistics, and we investigate its ability to handle
the great variety of sound classes encountered in ambient sound scenes, which
have often not been seen in training. To do so, we create single-channel
foreground-background mixtures using isolated sounds from the DESED and
Audioset datasets, and we conduct extensive experiments with mixtures of seen
or unseen sound classes at various signal-to-noise ratios. Our experimental
findings demonstrate the generalization ability of the proposed approach.
Related papers
- Sound event localization and classification using WASN in Outdoor Environment [2.234738672139924]
Methods for sound event localization and classification typically rely on a single microphone array.
We propose a deep learning-based method that employs multiple features and attention mechanisms to estimate the location and class of sound source.
arXiv Detail & Related papers (2024-03-29T11:44:14Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Visual Sound Localization in the Wild by Cross-Modal Interference
Erasing [90.21476231683008]
In real-world scenarios, audios are usually contaminated by off-screen sound and background noise.
We propose the Interference Eraser (IEr) framework, which tackles the problem of audio-visual sound source localization in the wild.
arXiv Detail & Related papers (2022-02-13T21:06:19Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Self-Supervised Learning from Automatically Separated Sound Scenes [38.71803524843168]
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into semantically-linked views.
We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches.
arXiv Detail & Related papers (2021-05-05T15:37:17Z) - Visually Informed Binaural Audio Generation without Binaural Audios [130.80178993441413]
We propose PseudoBinaural, an effective pipeline that is free of recordings.
We leverage spherical harmonic decomposition and head-related impulse response (HRIR) to identify the relationship between spatial locations and received audios.
Our-recording-free pipeline shows great stability in cross-dataset evaluation and achieves comparable performance under subjective preference.
arXiv Detail & Related papers (2021-04-13T13:07:33Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z) - Weakly-supervised Audio-visual Sound Source Detection and Separation [38.52168086518221]
We propose an audio-visual co-segmentation, where the network learns both what individual objects look and sound like.
We introduce weakly-supervised object segmentation in the context of sound separation.
Our architecture can be learned in an end-to-end manner and requires no additional supervision or bounding box proposals.
arXiv Detail & Related papers (2021-03-25T10:17:55Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of
On-Screen Sounds [33.4237979175049]
We present AudioScope, a novel audio-visual sound separation framework.
It can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos.
We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data.
arXiv Detail & Related papers (2020-11-02T17:36:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.