DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection
- URL: http://arxiv.org/abs/2106.15190v1
- Date: Tue, 29 Jun 2021 09:18:30 GMT
- Title: DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic
Sound Event Localization and Detection
- Authors: Thi Ngoc Tho Nguyen and Karn Watcharasupat and Ngoc Khanh Nguyen and
Douglas L. Jones and Woon Seng Gan
- Abstract summary: We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival.
We show that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin.
- Score: 16.18806719313959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sound event localization and detection consists of two subtasks which are
sound event detection and direction-of-arrival estimation. While sound event
detection mainly relies on time-frequency patterns to distinguish different
sound classes, direction-of-arrival estimation uses magnitude or phase
differences between microphones to estimate source directions. Therefore, it is
often difficult to jointly train these two subtasks simultaneously. We propose
a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact
time-frequency mapping between the signal power and the source
direction-of-arrival. The feature includes multichannel log-spectrograms
stacked along with the estimated direct-to-reverberant ratio and a normalized
version of the principal eigenvector of the spatial covariance matrix at each
time-frequency bin on the spectrograms. Experimental results on the DCASE 2021
dataset for sound event localization and detection with directional
interference showed that the deep learning-based models trained on this new
feature outperformed the DCASE challenge baseline by a large margin. We
combined several models with slightly different architectures that were trained
on the new feature to further improve the system performances for the DCASE
sound event localization and detection challenge.
Related papers
- Path-adaptive Spatio-Temporal State Space Model for Event-based Recognition with Arbitrary Duration [9.547947845734992]
Event cameras are bio-inspired sensors that capture the intensity changes asynchronously and output event streams.
We present a novel framework, dubbed PAST-Act, exhibiting superior capacity in recognizing events with arbitrary duration.
We also build a minute-level event-based recognition dataset, named ArDVS100, with arbitrary duration for the benefit of the community.
arXiv Detail & Related papers (2024-09-25T14:08:37Z) - Multimodal Attention-Enhanced Feature Fusion-based Weekly Supervised Anomaly Violence Detection [1.9223495770071632]
This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features.
The system significantly improves anomaly detection accuracy and robustness across three datasets.
arXiv Detail & Related papers (2024-09-17T14:17:52Z) - STMixer: A One-Stage Sparse Action Detector [48.0614066856134]
We propose a new one-stage action detector, termed STMixer.
We present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative video features.
We obtain the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB.
arXiv Detail & Related papers (2023-03-28T10:47:06Z) - Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval [7.459223771397159]
Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared.
To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels.
We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
arXiv Detail & Related papers (2022-11-07T10:37:14Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - A benchmark of state-of-the-art sound event detection systems evaluated
on synthetic soundscapes [10.512055210540668]
We study the solutions proposed by participants to analyze their robustness to varying level target to non-target signal-to-noise ratio and to temporal localization of target sound events.
Results show that systems tend to spuriously predict short events when non-target events are present.
arXiv Detail & Related papers (2022-02-03T09:41:31Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Exploiting Attention-based Sequence-to-Sequence Architectures for Sound
Event Localization [113.19483349876668]
This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model.
It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions.
arXiv Detail & Related papers (2021-02-28T07:52:20Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.