Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification
- URL: http://arxiv.org/abs/2508.21243v1
- Date: Thu, 28 Aug 2025 22:13:20 GMT
- Title: Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification
- Authors: Aditya Makineni, Baocheng Geng, Qing Tian,
- Abstract summary: We propose a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context.<n>We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget.
- Score: 3.588372242361407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.
Related papers
- PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLMs [57.790910044227935]
Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames.<n>We present Phase Aggregated Smoothing (PAS), a training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs.<n>Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling.
arXiv Detail & Related papers (2025-11-14T05:56:47Z) - Edit-Your-Interest: Efficient Video Editing via Feature Most-Similar Propagation [53.05471174430247]
Edit-Your-Interest is a text-driven, zero-shot video editing method.<n>It reduces computational overhead compared to full-sequence-temporal modeling approaches.<n>It outperforms state-of-the-art methods in both efficiency and visual fidelity.
arXiv Detail & Related papers (2025-10-15T01:55:32Z) - Dual-Domain Masked Image Modeling: A Self-Supervised Pretraining Strategy Using Spatial and Frequency Domain Masking for Hyperspectral Data [35.34526230299484]
We propose a self-supervised pretraining strategy for hyperspectral data that utilizes the large portion of unlabeled data.<n>Our method introduces a novel dual-domain masking mechanism that operates in both spatial and frequency domains.<n>We evaluate our approach on three publicly available HSI classification benchmarks and demonstrate that it achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-06T06:24:21Z) - Multi-View Spectrogram Transformer for Respiratory Sound Classification [32.346046623638394]
A Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer.
Experimental results on the ICBHI dataset demonstrate that the proposed MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds.
arXiv Detail & Related papers (2023-11-16T08:17:02Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Multiscale Audio Spectrogram Transformer for Efficient Audio
Classification [1.797470734877199]
We develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification.
Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions.
arXiv Detail & Related papers (2023-03-19T20:21:29Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models.
MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum.
For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.