BEAT2AASIST model with layer fusion for ESDD 2026 Challenge
- URL: http://arxiv.org/abs/2512.15180v1
- Date: Wed, 17 Dec 2025 08:24:12 GMT
- Title: BEAT2AASIST model with layer fusion for ESDD 2026 Challenge
- Authors: Sanghyeok Chung, Eujin Kim, Donggun Kim, Gaeun Heo, Jeongbin You, Nahyun Lee, Sunmook Choi, Soyul Han, Seungsang Oh, Il-Youp Kwak,
- Abstract summary: ESDD 2026 Challenge is the first large-scale benchmark for Environmental Sound Deepfake Detection.<n>Recent advances in audio generation have increased the risk of realistic environmental sound manipulation.<n>We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches.
- Score: 4.983827120166267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in audio generation have increased the risk of realistic environmental sound manipulation, motivating the ESDD 2026 Challenge as the first large-scale benchmark for Environmental Sound Deepfake Detection (ESDD). We propose BEAT2AASIST which extends BEATs-AASIST by splitting BEATs-derived representations along frequency or channel dimension and processing them with dual AASIST branches. To enrich feature representations, we incorporate top-k transformer layer fusion using concatenation, CNN-gated, and SE-gated strategies. In addition, vocoder-based data augmentation is applied to improve robustness against unseen spoofing methods. Experimental results on the official test sets demonstrate that the proposed approach achieves competitive performance across the challenge tracks.
Related papers
- Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z) - Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection [53.689841037081834]
Ivan-ISTD is designed to address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD.<n>Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-14T07:48:31Z) - SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering [13.592413960039044]
We propose SynSonic, a data augmentation method tailored for Sound Event Detection.<n>We show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.
arXiv Detail & Related papers (2025-09-23T03:48:26Z) - ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signals [8.411477071838592]
We propose a novel foundation model ECHO that integrates an advanced band-split architecture with frequency positional embeddings.<n>We evaluate our method on various kinds of machine signal datasets.
arXiv Detail & Related papers (2025-08-20T13:10:44Z) - HOLA: Enhancing Audio-visual Deepfake Detection via Hierarchical Contextual Aggregations and Efficient Pre-training [17.005718886553865]
We present HOLA, our solution to the Video-Level Deepfake Detection track of 2025 1M-Deepfakes Detection Challenge.<n>Inspired by the success of large-scale pre-training in the general domain, we first scale audio-visual self-supervised pre-training in the multimodal video-level deepfake detection.<n>To be specific, HOLA features an iterative-aware cross-modal learning module for selective audio-visual interactions, hierarchical contextual modeling with gated aggregations under the local-global perspective, and a pyramid-like refiner for scale-aware cross-grained semantic enhancements.
arXiv Detail & Related papers (2025-07-30T15:47:12Z) - Deep Active Speech Cancellation with Mamba-Masking Network [62.73250985838971]
We present a novel deep learning network for Active Speech Cancellation (ASC)<n>The proposed Mamba-Masking architecture introduces a masking mechanism that directly interacts with the encoded reference signal.<n> Experimental results demonstrate substantial performance gains, achieving up to 7.2dB improvement in ANC scenarios and 6.2dB in ASC.
arXiv Detail & Related papers (2025-02-03T09:22:26Z) - RF Challenge: The Data-Driven Radio Frequency Signal Separation Challenge [66.33067693672696]
We address the critical problem of interference rejection in radio-frequency (RF) signals using a data-driven approach that leverages deep-learning methods.<n>A primary contribution of this paper is the introduction of the RF Challenge, which is a publicly available, diverse RF signal dataset.
arXiv Detail & Related papers (2024-09-13T13:53:41Z) - SELD-Mamba: Selective State-Space Model for Sound Event Localization and Detection with Source Distance Estimation [21.82296230219289]
We propose a network architecture for SELD called SELD-Mamba, which utilizes Mamba, a selective state-space model.
We adopt the Event-Independent Network V2 (EINV2) as the foundational framework and replace its Conformer blocks with bidirectional Mamba blocks.
We implement a two-stage training method, with the first stage focusing on Sound Event Detection (SED) and Direction of Arrival (DoA) estimation losses, and the second stage reintroducing the Source Distance Estimation (SDE) loss.
arXiv Detail & Related papers (2024-08-09T13:26:08Z) - Retrieval-Augmented Audio Deepfake Detection [27.13059118273849]
We propose a retrieval-augmented detection framework that augments test samples with similar retrieved samples for enhanced detection.
Experiments show the superior performance of the proposed RAD framework over baseline methods.
arXiv Detail & Related papers (2024-04-22T05:46:40Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization [54.41494515178297]
We reformulate speaker diarization as a single-label classification problem.
We propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly.
Compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates.
arXiv Detail & Related papers (2023-03-08T05:05:26Z) - Deep Dense and Convolutional Autoencoders for Unsupervised Anomaly
Detection in Machine Condition Sounds [55.18259748448095]
This report describes two methods that were developed for Task 2 of the DCASE 2020 challenge.
The challenge involves an unsupervised learning to detect anomalous sounds, thus only normal machine working condition samples are available during the training process.
The two methods involve deep autoencoders, based on dense and convolutional architectures that use melspectogram processed sound features.
arXiv Detail & Related papers (2020-06-18T10:49:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.