Spatial mixup: Directional loudness modification as data augmentation
for sound event localization and detection
- URL: http://arxiv.org/abs/2110.06126v1
- Date: Tue, 12 Oct 2021 16:16:58 GMT
- Title: Spatial mixup: Directional loudness modification as data augmentation
for sound event localization and detection
- Authors: Ricardo Falcon-Perez, Kazuki Shimada, Yuichiro Koyama, Shusuke
Takahashi, Yuki Mitsufuji
- Abstract summary: We propose Spatial Mixup as an application of parametric spatial audio effects for data augmentation.
The modifications enhance or suppress signals arriving from certain directions, although the effect is less pronounced.
The method is evaluated with experiments in the DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a non-augmented baseline.
- Score: 9.0259157539478
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation methods have shown great importance in diverse supervised
learning problems where labeled data is scarce or costly to obtain. For sound
event localization and detection (SELD) tasks several augmentation methods have
been proposed, with most borrowing ideas from other domains such as images,
speech, or monophonic audio. However, only a few exploit the spatial properties
of a full 3D audio scene. We propose Spatial Mixup, as an application of
parametric spatial audio effects for data augmentation, which modifies the
directional properties of a multi-channel spatial audio signal encoded in the
ambisonics domain. Similarly to beamforming, these modifications enhance or
suppress signals arriving from certain directions, although the effect is less
pronounced. Therefore enabling deep learning models to achieve invariance to
small spatial perturbations. The method is evaluated with experiments in the
DCASE 2021 Task 3 dataset, where spatial mixup increases performance over a
non-augmented baseline, and compares to other well known augmentation methods.
Furthermore, combining spatial mixup with other methods greatly improves
performance.
Related papers
- Low-light Stereo Image Enhancement and De-noising in the Low-frequency
Information Enhanced Image Space [5.1569866461097185]
Methods are proposed to perform enhancement and de-noising simultaneously.
Low-frequency information enhanced module (IEM) is proposed to suppress noise and produce a new image space.
Cross-channel and spatial context information mining module (CSM) is proposed to encode long-range spatial dependencies.
An encoder-decoder structure is constructed, incorporating cross-view and cross-scale feature interactions.
arXiv Detail & Related papers (2024-01-15T15:03:32Z) - Attention-Driven Multichannel Speech Enhancement in Moving Sound Source
Scenarios [11.811571392419324]
Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios.
This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
arXiv Detail & Related papers (2023-12-17T16:12:35Z) - Exploring Self-Supervised Contrastive Learning of Spatial Sound Event
Representation [21.896817015593122]
MC-SimCLR learns joint spectral and spatial representations from unlabeled spatial audios.
We propose a multi-level data augmentation pipeline that augments different levels of audio features.
We find that linear layers on top of the learned representation significantly outperform supervised models in terms of both event classification accuracy and localization error.
arXiv Detail & Related papers (2023-09-27T18:23:03Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio
Detection [54.20974251478516]
We propose a continual learning algorithm for fake audio detection to overcome catastrophic forgetting.
When fine-tuning a detection network, our approach adaptively computes the direction of weight modification according to the ratio of genuine utterances and fake utterances.
Our method can easily be generalized to related fields, like speech emotion recognition.
arXiv Detail & Related papers (2023-08-07T05:05:49Z) - Spectral Enhanced Rectangle Transformer for Hyperspectral Image
Denoising [64.11157141177208]
We propose a spectral enhanced rectangle Transformer to model the spatial and spectral correlation in hyperspectral images.
For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain.
For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise.
arXiv Detail & Related papers (2023-04-03T09:42:13Z) - Blind Room Parameter Estimation Using Multiple-Multichannel Speech
Recordings [37.145413836886455]
Knowing the geometrical and acoustical parameters of a room may benefit applications such as audio augmented reality, speech dereverberation or audio forensics.
We study the problem of jointly estimating the total surface area, the volume, as well as the frequency-dependent reverberation time and mean surface absorption of a room.
A novel convolutional neural network architecture leveraging both single- and inter-channel cues is proposed and trained on a large, realistic simulated dataset.
arXiv Detail & Related papers (2021-07-29T08:51:49Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.