A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification
- URL: http://arxiv.org/abs/2409.13676v1
- Date: Thu, 19 Sep 2024 11:27:50 GMT
- Title: A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification
- Authors: Michel Olvera, Paraskevas Stamatiadis, Slim Essid,
- Abstract summary: We explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options.
We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets.
- Score: 7.622135228307756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts significantly affects performance so that simply prompting the models with properly formatted class labels performs competitively with optimized prompt templates and even prompt ensembling. Moreover, we look into complementing class labels by audio-centric descriptions. By leveraging large language models, we generate textual descriptions that prioritize acoustic features of sound events to disambiguate between classes, without extensive prompt engineering. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets. Remarkably, this method requires no additional training and remains fully zero-shot.
Related papers
- Improving Audio Classification by Transitioning from Zero- to Few-Shot [4.31241676251521]
State-of-the-art audio classification often employs a zero-shot approach.<n>This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach.
arXiv Detail & Related papers (2025-07-26T18:40:09Z) - TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining [3.5570874721859016]
We propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording.<n>Our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark.
arXiv Detail & Related papers (2025-05-12T14:30:39Z) - Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification [44.101538324619604]
TSPE (Task-Specific Prompt Ensemble) is a training-free hard prompting method that boosts ALEs' zero-shot performance.
We leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street"
To enhance audio-text alignment, we perform prompt ensemble across TSPE-generated task-specific prompts.
arXiv Detail & Related papers (2024-12-31T11:27:17Z) - Vision-Language Models are Strong Noisy Label Detectors [76.07846780815794]
This paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models.
DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels.
Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
arXiv Detail & Related papers (2024-09-29T12:55:17Z) - ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds [45.534228559551316]
We propose a simple but effective method to improve zero-shot audio classification with CLAP.
We first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild.
Our proposed method improves ReCLAP's performance on ZSAC by 1%-18% and outperforms all baselines by 1% - 55%.
arXiv Detail & Related papers (2024-09-13T21:58:20Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z) - SemanticAC: Semantics-Assisted Framework for Audio Classification [13.622344835167997]
We propose SemanticAC, a semantics-assisted framework for Audio Classification.
We employ a language model to extract abundant semantics from labels and optimize the semantic consistency between audio signals and their labels.
Our proposed method consistently outperforms the compared audio classification methods.
arXiv Detail & Related papers (2023-02-12T15:30:28Z) - CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class
Classification [57.62886091828512]
We propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix) for many-class classification.
Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification.
arXiv Detail & Related papers (2022-11-11T03:45:59Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - Zero-Shot Audio Classification using Image Embeddings [16.115449653258356]
We introduce image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection.
We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification.
arXiv Detail & Related papers (2022-06-10T10:36:56Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt
Verbalizer for Text Classification [68.3291372168167]
We focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt-tuning (KPT)
We expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space.
Experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.
arXiv Detail & Related papers (2021-08-04T13:00:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.