Related papers: CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

URL: http://arxiv.org/abs/2602.07077v1
Date: Fri, 06 Feb 2026 01:58:29 GMT
Title: CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
Authors: Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass, M. Jehanzeb Mirza,
Abstract summary: We propose a few-shot classification method that learns class-dependent importance weights over attention heads.<n>Our method consistently outperforms state-of-the-art uniform voting-based approaches.
Score: 42.7207338433098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.

Related papers

Self-Ensemble Post Learning for Noisy Domain Generalization [18.4218677759831]
This paper explores how to make existing methods rework when meeting noise.<n>We find that the latent features inside the model have certain discriminative capabilities.<n>We propose the Self-Ensemble Post Learning approach to diversify features which can be leveraged.
arXiv Detail & Related papers (2025-12-11T17:09:35Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.<n>We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.<n>In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Multi-label Zero-Shot Audio Classification with Temporal Attention [8.518434546898524]
The present study introduces a method to perform multi-label zero-shot audio classification. We adapt temporal attention to assign importance weights to different audio segments based on their acoustic and semantic compatibility. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
arXiv Detail & Related papers (2024-08-31T09:49:41Z)
Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs [3.8300818830608345]
Multi-modal contrastive learning strategies for audio and text have rapidly gained interest.<n>The ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research.<n>We propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL.
arXiv Detail & Related papers (2024-08-17T18:53:17Z)
Audio Contrastive-based Fine-tuning: Decoupling Representation Learning and Classification [26.82307246813389]
We propose a disentangled two-stage framework that separates representation refinement from downstream evaluation.<n>First, we employ a "contrastive-tuning" stage to explicitly improve the geometric structure of the model's embedding space.<n>We then introduce a dual-probe evaluation protocol to assess the quality of these refined representations from a geometric perspective.
arXiv Detail & Related papers (2023-09-21T08:59:13Z)
Anomaly Detection using Ensemble Classification and Evidence Theory [62.997667081978825]
We present a novel approach for novel detection using ensemble classification and evidence theory. A pool selection strategy is presented to build a solid ensemble classifier. We use uncertainty for the anomaly detection approach.
arXiv Detail & Related papers (2022-12-23T00:50:41Z)
SLICER: Learning universal audio representations using low-resource self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z)
Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling. We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z)
Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches [52.67723703088284]
We propose a novel framework called multi-patch generative adversarial nets (MPGAN) MPGAN synthesises local patch features and labels unseen classes with a novel weighted voting strategy. MPGAN has significantly greater accuracy than state-of-the-art methods.
arXiv Detail & Related papers (2020-07-27T05:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.