BAT: Better Audio Transformer Guided by Convex Gated Probing
- URL: http://arxiv.org/abs/2602.16305v1
- Date: Wed, 18 Feb 2026 09:37:20 GMT
- Title: BAT: Better Audio Transformer Guided by Convex Gated Probing
- Authors: Houtan Ghaffari, Lukas Rauch, Christoph Scholz, Paul Devos,
- Abstract summary: Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings.<n>Audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential.<n>We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio.
- Score: 2.705930097524593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Related papers
- Representation-Regularized Convolutional Audio Transformer for Audio Understanding [53.092757178419355]
bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge.<n>We propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges.
arXiv Detail & Related papers (2026-01-29T12:16:19Z) - USAD: Universal Speech and Audio Representation via Distillation [56.91647396619358]
Universal Speech and Audio Distillation (USAD) is a unified approach to audio representation learning.<n>USAD integrates diverse audio types - speech, sound, and music - into a single model.
arXiv Detail & Related papers (2025-06-23T17:02:00Z) - SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes [9.639849424773614]
Self-Supervised Learning from Audio Mixtures (SSLAM) is designed to improve the model's ability to learn from polyphonic data.<n>SSLAM achieves up to a 3.9% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2.<n>For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1% (mAP)
arXiv Detail & Related papers (2025-06-13T20:48:46Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Compact Speech Translation Models via Discrete Speech Units Pretraining [75.27125825975858]
Our method is based on Discrete Speech Units (DSU) extracted from the SSS model.
In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings.
arXiv Detail & Related papers (2024-02-29T16:36:51Z) - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z) - Leveraging Pre-trained AudioLDM for Sound Generation: A Benchmark Study [33.10311742703679]
We make the first attempt to investigate the benefits of pre-training on sound generation with AudioLDM.
Our study demonstrates the advantages of the pre-trained AudioLDM, especially in data-scarcity scenarios.
We benchmark the sound generation task on various frequently-used datasets.
arXiv Detail & Related papers (2023-03-07T12:49:45Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Deploying self-supervised learning in the wild for hybrid automatic
speech recognition [20.03807843795386]
Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR)
We show how to utilize untranscribed audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model.
arXiv Detail & Related papers (2022-05-17T19:37:40Z) - Audio Self-supervised Learning: A Survey [60.41768569891083]
Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations.
Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing.
arXiv Detail & Related papers (2022-03-02T15:58:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.