Related papers: ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

URL: http://arxiv.org/abs/2409.09213v1
Date: Fri, 13 Sep 2024 21:58:20 GMT
Title: ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds
Authors: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha,
Abstract summary: We propose a simple but effective method to improve zero-shot audio classification with CLAP. We first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. Our proposed method improves ReCLAP's performance on ZSAC by 1%-18% and outperforms all baselines by 1% - 55%.
Score: 45.534228559551316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category labels (e.g., Sound of an organ) to prompts that describe sounds using their inherent descriptive features in a diverse context (e.g.,The organ's deep and resonant tones filled the cathedral.). To achieve this, we first propose ReCLAP, a CLAP model trained with rewritten audio captions for improved understanding of sounds in the wild. These rewritten captions describe each sound event in the original caption using their unique discriminative characteristics. ReCLAP outperforms all baselines on both multi-modal audio-text retrieval and ZSAC. Next, to improve zero-shot audio classification with ReCLAP, we propose prompt augmentation. In contrast to the traditional method of employing hand-written template prompts, we generate custom prompts for each unique label in the dataset. These custom prompts first describe the sound event in the label and then employ them in diverse scenes. Our proposed method improves ReCLAP's performance on ZSAC by 1%-18% and outperforms all baselines by 1% - 55%.

Related papers

Improving Audio Classification by Transitioning from Zero- to Few-Shot [4.31241676251521]
State-of-the-art audio classification often employs a zero-shot approach.<n>This paper examines few-shot methods designed to improve classification accuracy beyond the zero-shot approach.
arXiv Detail & Related papers (2025-07-26T18:40:09Z)
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining [3.5570874721859016]
We propose a frame-wise contrastive training strategy that learns to align text descriptions with temporal regions in an audio recording.<n>Our model has better temporal text-audio alignment abilities compared to models trained only on global captions when evaluated on the AudioSet Strong benchmark.
arXiv Detail & Related papers (2025-05-12T14:30:39Z)
Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning. Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system. Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z)
TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification [44.101538324619604]
TSPE (Task-Specific Prompt Ensemble) is a training-free hard prompting method that boosts ALEs' zero-shot performance. We leverage label information to identify suitable sound attributes, such as "loud" and "feeble", and appropriate sound sources, such as "tunnel" and "street" To enhance audio-text alignment, we perform prompt ensemble across TSPE-generated task-specific prompts.
arXiv Detail & Related papers (2024-12-31T11:27:17Z)
Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries. We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification [7.622135228307756]
We explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. We show that prompting with class descriptions leads to state-of-the-art results in zero-shot audio classification across major ambient sound datasets.
arXiv Detail & Related papers (2024-09-19T11:27:50Z)
Listenable Maps for Zero-Shot Audio Classifiers [12.446324804274628]
We introduce LMAC-Z (Listenable Maps for Audio) for the first time in the Zero-Shot context. We show that our method produces meaningful explanations that correlate well with different text prompts.
arXiv Detail & Related papers (2024-05-27T19:25:42Z)
Zero-shot audio captioning with audio-language model guidance and audio context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training. Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions. Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z)
Gen-Z: Generative Zero-Shot Text Classification with Contextualized Label Descriptions [50.92702206798324]
We propose a generative prompting framework for zero-shot text classification. GEN-Z measures the LM likelihood of input text conditioned on natural language descriptions of labels. We show that zero-shot classification with simple contextualization of the data source consistently outperforms both zero-shot and few-shot baselines.
arXiv Detail & Related papers (2023-11-13T07:12:57Z)
Generalized zero-shot audio-to-intent classification [7.76114116227644]
We propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. We leverage a neural audio synthesizer to create audio embeddings for sample text utterances. Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2%.
arXiv Detail & Related papers (2023-11-04T18:55:08Z)
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models [41.98394436858637]
We propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples. We first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities.
arXiv Detail & Related papers (2023-10-12T22:43:38Z)
Efficient Audio Captioning Transformer with Patchout and Text Guidance [74.59739661383726]
We propose a full Transformer architecture that utilizes Patchout as proposed in [1], significantly reducing the computational complexity and avoiding overfitting. The caption generation is partly conditioned on textual AudioSet tags extracted by a pre-trained classification model. Our proposed method received the Judges Award at the Task6A of DCASE Challenge 2022.
arXiv Detail & Related papers (2023-04-06T07:58:27Z)
What does a platypus look like? Generating customized prompts for zero-shot image classification [52.92839995002636]
This work introduces a simple method to generate higher accuracy prompts without relying on any explicit knowledge of the task domain. We leverage the knowledge contained in large language models (LLMs) to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet.
arXiv Detail & Related papers (2022-09-07T17:27:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.