Multi-label Zero-Shot Audio Classification with Temporal Attention
- URL: http://arxiv.org/abs/2409.00408v1
- Date: Sat, 31 Aug 2024 09:49:41 GMT
- Title: Multi-label Zero-Shot Audio Classification with Temporal Attention
- Authors: Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen,
- Abstract summary: The present study introduces a method to perform multi-label zero-shot audio classification.
We adapt temporal attention to assign importance weights to different audio segments based on their acoustic and semantic compatibility.
Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
- Score: 8.518434546898524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot learning models are capable of classifying new classes by transferring knowledge from the seen classes using auxiliary information. While most of the existing zero-shot learning methods focused on single-label classification tasks, the present study introduces a method to perform multi-label zero-shot audio classification. To address the challenge of classifying multi-label sounds while generalizing to unseen classes, we adapt temporal attention. The temporal attention mechanism assigns importance weights to different audio segments based on their acoustic and semantic compatibility, thus enabling the model to capture the varying dominance of different sound classes within an audio sample by focusing on the segments most relevant for each class. This leads to more accurate multi-label zero-shot classification than methods employing temporally aggregated acoustic features without weighting, which treat all audio segments equally. We evaluate our approach on a subset of AudioSet against a zero-shot model using uniformly aggregated acoustic features, a zero-rule baseline, and the proposed method in the supervised scenario. Our results show that temporal attention enhances the zero-shot audio classification performance in multi-label scenario.
Related papers
- Heterogeneous sound classification with the Broad Sound Taxonomy and Dataset [6.91815289914328]
This paper explores methodologies for automatically classifying heterogeneous sounds characterized by high intra-class variability.
We construct a dataset through manual annotation to ensure accuracy, diverse representation within each class and relevance in real-world scenarios.
Experimental results illustrate that audio embeddings encoding acoustic and semantic information achieve higher accuracy in the classification task.
arXiv Detail & Related papers (2024-10-01T18:09:02Z) - Generalized zero-shot audio-to-intent classification [7.76114116227644]
We propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent.
We leverage a neural audio synthesizer to create audio embeddings for sample text utterances.
Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2%.
arXiv Detail & Related papers (2023-11-04T18:55:08Z) - Class-Incremental Grouping Network for Continual Audio-Visual Learning [42.284785756540806]
We propose a class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning.
We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks.
Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance.
arXiv Detail & Related papers (2023-09-11T07:36:16Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Anomaly Detection using Ensemble Classification and Evidence Theory [62.997667081978825]
We present a novel approach for novel detection using ensemble classification and evidence theory.
A pool selection strategy is presented to build a solid ensemble classifier.
We use uncertainty for the anomaly detection approach.
arXiv Detail & Related papers (2022-12-23T00:50:41Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Play It Back: Iterative Attention for Audio Recognition [104.628661890361]
A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time.
We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds.
We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks.
arXiv Detail & Related papers (2022-10-20T15:03:22Z) - Towards Unbiased Multi-label Zero-Shot Learning with Pyramid and
Semantic Attention [14.855116554722489]
Multi-label zero-shot learning aims at recognizing multiple unseen labels of classes for each input sample.
We propose a novel framework of unbiased multi-label zero-shot learning, by considering various class-specific regions.
arXiv Detail & Related papers (2022-03-07T15:52:46Z) - CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action
Recognition [52.66360172784038]
We propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually.
We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets.
arXiv Detail & Related papers (2021-01-18T12:46:24Z) - Contrastive Learning of General-Purpose Audio Representations [33.15189569532155]
We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio.
We build on recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement model of audio.
arXiv Detail & Related papers (2020-10-21T11:56:22Z) - Rethinking Generative Zero-Shot Learning: An Ensemble Learning
Perspective for Recognising Visual Patches [52.67723703088284]
We propose a novel framework called multi-patch generative adversarial nets (MPGAN)
MPGAN synthesises local patch features and labels unseen classes with a novel weighted voting strategy.
MPGAN has significantly greater accuracy than state-of-the-art methods.
arXiv Detail & Related papers (2020-07-27T05:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.