Sound Event Detection Transformer: An Event-based End-to-End Model for
Sound Event Detection
- URL: http://arxiv.org/abs/2110.02011v1
- Date: Tue, 5 Oct 2021 12:56:23 GMT
- Title: Sound Event Detection Transformer: An Event-based End-to-End Model for
Sound Event Detection
- Authors: Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long
Yan, Kazushige Ouchi
- Abstract summary: Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc.
Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem.
This paper firstly presents the 1D Detection Transformer (1D-DETR), inspired by Detection Transformer.
Given the characteristics of SED, the audio query and a one-to-many matching strategy are added to 1D-DETR to form the model of Sound Event Detection Transformer (SEDT)
- Score: 12.915110466077866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sound event detection (SED) has gained increasing attention with its wide
application in surveillance, video indexing, etc. Existing models in SED mainly
generate frame-level predictions, converting it into a sequence multi-label
classification problem, which inevitably brings a trade-off between event
boundary detection and audio tagging when using weakly labeled data to train
the model. Besides, it needs post-processing and cannot be trained in an
end-to-end way. This paper firstly presents the 1D Detection Transformer
(1D-DETR), inspired by Detection Transformer. Furthermore, given the
characteristics of SED, the audio query and a one-to-many matching strategy for
fine-tuning the model are added to 1D-DETR to form the model of Sound Event
Detection Transformer (SEDT), which generates event-level predictions,
end-to-end detection. Experiments are conducted on the URBAN-SED dataset and
the DCASE2019 Task4 dataset, and both experiments have achieved competitive
results compared with SOTA models. The application of SEDT on SED shows that it
can be used as a framework for one-dimensional signal detection and may be
extended to other similar tasks.
Related papers
- Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection [22.892382672888488]
Semi-supervised algorithms rely on labeled data to learn from unlabeled data.
We introduce the Prototype based Masked Audio Model(PMAM) algorithm for self-supervised representation learning in SED.
arXiv Detail & Related papers (2024-09-26T09:07:20Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Remote Sensing Change Detection With Transformers Trained from Scratch [62.96911491252686]
transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target benchmark.
We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on four public benchmarks.
arXiv Detail & Related papers (2023-04-13T17:57:54Z) - DEGAN: Time Series Anomaly Detection using Generative Adversarial
Network Discriminators and Density Estimation [0.0]
We have proposed an unsupervised Generative Adversarial Network (GAN)-based anomaly detection framework, DEGAN.
It relies solely on normal time series data as input to train a well-configured discriminator (D) into a standalone anomaly predictor.
arXiv Detail & Related papers (2022-10-05T04:32:12Z) - Event Data Association via Robust Model Fitting for Event-based Object Tracking [66.05728523166755]
We propose a novel Event Data Association (called EDA) approach to explicitly address the event association and fusion problem.
The proposed EDA seeks for event trajectories that best fit the event data, in order to perform unifying data association and information fusion.
The experimental results show the effectiveness of EDA under challenging scenarios, such as high speed, motion blur, and high dynamic range conditions.
arXiv Detail & Related papers (2021-10-25T13:56:00Z) - SoundDet: Polyphonic Sound Event Detection and Localization from Raw
Waveform [48.68714598985078]
SoundDet is an end-to-end trainable and light-weight framework for polyphonic moving sound event detection and localization.
SoundDet directly consumes the raw, multichannel waveform and treats the temporal sound event as a complete sound-object" to be detected.
A dense sound proposal event map is then constructed to handle the challenges of predicting events with large varying temporal duration.
arXiv Detail & Related papers (2021-06-13T11:43:41Z) - Few-Shot Event Detection with Prototypical Amortized Conditional Random
Field [8.782210889586837]
Event Detection tends to struggle when it needs to recognize novel event types with a few samples.
We present a novel unified joint model which converts the task to a few-shot tagging problem with a double-part tagging scheme.
We conduct experiments on the benchmark dataset FewEvent and the experimental results show that the tagging based methods are better than existing pipeline and joint learning methods.
arXiv Detail & Related papers (2020-12-04T01:11:13Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.