Related papers: ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

URL: http://arxiv.org/abs/2507.02666v1
Date: Thu, 03 Jul 2025 14:29:43 GMT
Title: ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
Authors: Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang,
Abstract summary: Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
Score: 57.67273340380651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.

Related papers

Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4 [2.68085089595424]
This report presents submission systems for Task 4 of the DCASE 2025 Challenge.<n>It incorporates additional audio features into the embedding feature extracted from the mel-spectral feature.<n>Second, an agent-based label correction system is applied to the outputs processed by the S5 system.
arXiv Detail & Related papers (2025-06-26T12:27:52Z)
Efficient Leaf Disease Classification and Segmentation using Midpoint Normalization Technique and Attention Mechanism [0.0]
We introduce a transformative two-stage methodology, Mid Point Normalization (MPN) for intelligent image preprocessing.<n>Our classification pipeline achieves 93% accuracy while maintaining exceptional class-wise balance.<n>For segmentation tasks, we seamlessly integrate identical attention blocks within U-Net architecture using MPN-enhanced inputs.
arXiv Detail & Related papers (2025-05-27T15:14:04Z)
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models [75.58140912100318]
We introduce an efficient large language model specialized for the system domain, empowered by a novel architecture including DiffQKV attention.<n>We conduct experiments that demonstrate the model's varying sensitivity to the compression of K and V components, leading to the development of differentially compressed KV.<n>We introduce the first comprehensive benchmark AIMicius, where Sigma demonstrates remarkable performance across all tasks, significantly outperforming GPT-4 with an absolute improvement up to 52.5%.
arXiv Detail & Related papers (2025-01-23T12:58:14Z)
Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features. Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation. Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z)
Microphone Conversion: Mitigating Device Variability in Sound Event Classification [0.0]
We introduce a new augmentation technique to enhance the resilience of sound event classification (SEC) systems against device variability through the use of CycleGAN. Our method addresses limited device diversity in training data by enabling unpaired training to transform input spectrograms as if they were recorded on a different device.
arXiv Detail & Related papers (2024-01-12T21:59:01Z)
MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
Capturing scattered discriminative information using a deep architecture in acoustic scene classification [49.86640645460706]
In this study, we investigate various methods to capture discriminative information and simultaneously mitigate the overfitting problem. We adopt a max feature map method to replace conventional non-linear activations in a deep neural network. Two data augment methods and two deep architecture modules are further explored to reduce overfitting and sustain the system's discriminative power.
arXiv Detail & Related papers (2020-07-09T08:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.