Related papers: MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection

URL: http://arxiv.org/abs/2408.08673v2
Date: Mon, 19 Aug 2024 07:11:39 GMT
Title: MAT-SED: A Masked Audio Transformer with Masked-Reconstruction Based Pre-training for Sound Event Detection
Authors: Pengfei Cai, Yan Song, Kang Li, Haoyu Song, Ian McLoughlin,
Abstract summary: We propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner.
Score: 18.0885324380572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sound event detection (SED) methods that leverage a large pre-trained Transformer encoder network have shown promising performance in recent DCASE challenges. However, they still rely on an RNN-based context network to model temporal dependencies, largely due to the scarcity of labeled data. In this work, we propose a pure Transformer-based SED model with masked-reconstruction based pre-training, termed MAT-SED. Specifically, a Transformer with relative positional encoding is first designed as the context network, pre-trained by the masked-reconstruction task on all available target data in a self-supervised way. Both the encoder and the context network are jointly fine-tuned in a semi-supervised manner. Furthermore, a global-local feature fusion strategy is proposed to enhance the localization capability. Evaluation of MAT-SED on DCASE2023 task4 surpasses state-of-the-art performance, achieving 0.587/0.896 PSDS1/PSDS2 respectively.

Related papers

A Foundation Model for DAS Signal Recognition and Visual Prompt Tuning of the Pre-trained Model for Downstream Tasks [6.14430079610632]
This study proposes a foundational model for DAS signal recognition based on a Masked Autocoder, named MAEPD.<n>The model is pretrained on a dataset of 635860 samples, encompassing DAS gait signals, 2temporal GASF images for perimeter security, 2D time-frequency images for pipeline leakage, and open-dataset signals including whale vocalizations and seismic activities.<n>The VPT-Deep approach achieves a classification accuracy of 96.94% with just 0.322% of parameters fine-tuned, surpassing the traditional Full Fine Tuning (FFT) method by 0.61% and reducing training time by
arXiv Detail & Related papers (2025-08-06T11:02:25Z)
Learning from Scratch: Structurally-masked Transformer for Next Generation Lib-free Simulation [3.7467132954493536]
This paper proposes a neural framework for power and timing prediction of multi-stage data path.<n>To the best of our knowledge, this is the first language-based, netlist-aware neural network designed explicitly for standard cells.
arXiv Detail & Related papers (2025-07-23T10:46:25Z)
A Wireless Foundation Model for Multi-Task Prediction [50.21098141769079]
We propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals.<n>After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and zero-shot performance on new tasks.
arXiv Detail & Related papers (2025-07-08T12:37:55Z)
TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments [3.3107717550009865]
TOAST (Task-Oriented Adaptive Semantic Transmission) is a unified framework designed to address the core challenge of multi-task optimization in wireless environments.<n>We formulate adaptive task balancing as a Markov decision process, employing deep reinforcement learning to dynamically adjust the trade-off between image reconstruction fidelity and semantic classification accuracy.<n>We integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our Swin Transformer-based joint source-channel coding architecture.
arXiv Detail & Related papers (2025-06-27T04:36:30Z)
Enhancing DNA Foundation Models to Address Masking Inefficiencies [18.54660252939211]
We propose a modified encoder-decoder architecture based on the masked autoencoder framework. We evaluate our approach on the BIOSCAN-5M dataset, comprising over 2 million unique DNA barcodes.
arXiv Detail & Related papers (2025-02-25T17:56:25Z)
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units. We enhance the model performance by subword prediction in the first-pass decoder. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z)
Efficient Sparsely Activated Transformers [0.34410212782758054]
Transformer-based neural networks have achieved state-of-the-art task performance in a number of machine learning domains. Recent work has explored the integration of dynamic behavior into these networks in the form of mixture-of-expert layers. We introduce a novel system named PLANER that takes an existing Transformer-based network and a user-defined latency target.
arXiv Detail & Related papers (2022-08-31T00:44:27Z)
Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP. Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
Context-Aware Transformer Transducer for Speech Recognition [21.916660252023707]
We present a novel context-aware transformer transducer (CATT) network that improves the state-of-the-art transformer-based ASR system by taking advantage of such contextual signals. We show that CATT, using a BERT based context encoder, improves the word error rate of the baseline transformer transducer and outperforms an existing deep contextual model by 24.2% and 19.4% respectively.
arXiv Detail & Related papers (2021-11-05T04:14:35Z)
Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection [12.915110466077866]
Sound event detection (SED) has gained increasing attention with its wide application in surveillance, video indexing, etc. Existing models in SED mainly generate frame-level predictions, converting it into a sequence multi-label classification problem. This paper firstly presents the 1D Detection Transformer (1D-DETR), inspired by Detection Transformer. Given the characteristics of SED, the audio query and a one-to-many matching strategy are added to 1D-DETR to form the model of Sound Event Detection Transformer (SEDT)
arXiv Detail & Related papers (2021-10-05T12:56:23Z)
Wake Word Detection with Streaming Transformers [72.66551640048405]
We show that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25%.
arXiv Detail & Related papers (2021-02-08T19:14:32Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training. We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z)
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection [28.701023986344993]
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Our RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Task12 for the English language.
arXiv Detail & Related papers (2020-04-23T23:59:58Z)
A Multi-Scale Tensor Network Architecture for Classification and Regression [0.0]
We present an algorithm for supervised learning using tensor networks. We employ a step of preprocessing the data by coarse-graining through a sequence of wavelet transformations. We show how fine-graining through the network may be used to initialize models with access to finer-scale features.
arXiv Detail & Related papers (2020-01-22T21:26:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.