Zero-Shot Audio Classification using Image Embeddings
- URL: http://arxiv.org/abs/2206.04984v1
- Date: Fri, 10 Jun 2022 10:36:56 GMT
- Title: Zero-Shot Audio Classification using Image Embeddings
- Authors: Duygu Dogan, Huang Xie, Toni Heittola, Tuomas Virtanen
- Abstract summary: We introduce image embeddings as side information on zero-shot audio classification by using a nonlinear acoustic-semantic projection.
We demonstrate that the image embeddings can be used as semantic information to perform zero-shot audio classification.
- Score: 16.115449653258356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Supervised learning methods can solve the given problem in the presence of a
large set of labeled data. However, the acquisition of a dataset covering all
the target classes typically requires manual labeling which is expensive and
time-consuming. Zero-shot learning models are capable of classifying the unseen
concepts by utilizing their semantic information. The present study introduces
image embeddings as side information on zero-shot audio classification by using
a nonlinear acoustic-semantic projection. We extract the semantic image
representations from the Open Images dataset and evaluate the performance of
the models on an audio subset of AudioSet using semantic information in
different domains; image, audio, and textual. We demonstrate that the image
embeddings can be used as semantic information to perform zero-shot audio
classification. The experimental results show that the image and textual
embeddings display similar performance both individually and together. We
additionally calculate the semantic acoustic embeddings from the test samples
to provide an upper limit to the performance. The results show that the
classification performance is highly sensitive to the semantic relation between
test and training classes and textual and image embeddings can reach up to the
semantic acoustic embeddings when the seen and unseen classes are semantically
similar.
Related papers
- Evaluating authenticity and quality of image captions via sentiment and semantic analyses [0.0]
Deep learning relies heavily on huge amounts of labelled data for tasks such as natural language processing and computer vision.
In image-to-text or image-to-image pipelines, opinion (sentiment) may be inadvertently learned by a model from human-generated image captions.
This study proposes an evaluation method focused on sentiment and semantic richness.
arXiv Detail & Related papers (2024-09-14T23:50:23Z) - CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
arXiv Detail & Related papers (2023-10-15T07:20:22Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Evaluating language-biased image classification based on semantic
representations [13.508894957080777]
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.
Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification.
arXiv Detail & Related papers (2022-01-26T15:46:36Z) - Detection and Captioning with Unseen Object Classes [12.894104422808242]
Test images may contain visual objects with no corresponding visual or textual training examples.
We propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model.
Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset.
arXiv Detail & Related papers (2021-08-13T10:43:20Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Learning unbiased zero-shot semantic segmentation networks via
transductive transfer [14.55508599873219]
We propose an easy-to-implement transductive approach to alleviate the prediction bias in zero-shot semantic segmentation.
Our method assumes both the source images with full pixel-level labels and unlabeled target images are available during training.
arXiv Detail & Related papers (2020-07-01T14:25:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.