Related papers: Improving Audio Classification by Transitioning from Zero- to Few-Shot

Improving Audio Classification by Transitioning from Zero- to Few-Shot

URL: http://arxiv.org/abs/2507.20036v1
Date: Sat, 26 Jul 2025 18:40:09 GMT
Title: Improving Audio Classification by Transitioning from Zero- to Few-Shot
Authors: James Taylor, Wolfgang Mack,
Abstract summary: State-of-the-art audio classification often employs a zero-shot approach.<n>This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach.
Score: 4.31241676251521
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art audio classification often employs a zero-shot approach, which involves comparing audio embeddings with embeddings from text describing the respective audio class. These embeddings are usually generated by neural networks trained through contrastive learning to align audio and text representations. Identifying the optimal text description for an audio class is challenging, particularly when the class comprises a wide variety of sounds. This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach. Specifically, audio embeddings are grouped by class and processed to replace the inherently noisy text embeddings. Our results demonstrate that few-shot classification typically outperforms the zero-shot baseline.

Related papers

TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining [3.5570874721859016]
We propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording.<n>Our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark.
arXiv Detail & Related papers (2025-05-12T14:30:39Z)
Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.<n>Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.<n> Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification [7.622135228307756]
We explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets.
arXiv Detail & Related papers (2024-09-19T11:27:50Z)
Multi-label Zero-Shot Audio Classification with Temporal Attention [8.518434546898524]
The present study introduces a method to perform multi-label zero-shot audio classification. We adapt temporal attention to assign importance weights to different audio segments based on their acoustic and semantic compatibility. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
arXiv Detail & Related papers (2024-08-31T09:49:41Z)
Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context.<n>We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z)
Zero-shot audio captioning with audio-language model guidance and audio context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training. Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z)
Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility. Our method is a zero-shot method, i.e., we do not learn to perform captioning. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z)
Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z)
Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems. A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions. The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Using multiple reference audios and style embedding constraints for speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios. The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z)
Leveraging Acoustic and Linguistic Embeddings from Pretrained speech and language Models for Intent Classification [81.80311855996584]
We propose a novel intent classification framework that employs acoustic features extracted from a pretrained speech recognition system and linguistic features learned from a pretrained language model. We achieve 90.86% and 99.07% accuracy on ATIS and Fluent speech corpus, respectively.
arXiv Detail & Related papers (2021-02-15T07:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.