Semantically Grounded Visual Embeddings for Zero-Shot Learning
- URL: http://arxiv.org/abs/2201.00577v1
- Date: Mon, 3 Jan 2022 10:43:15 GMT
- Title: Semantically Grounded Visual Embeddings for Zero-Shot Learning
- Authors: Shah Nawaz, Jacopo Cavazza, Alessio Del Bue
- Abstract summary: We propose to learn semantically grounded and enriched visual information by computing a joint image and text model with a two-stream network on a proxy task.
Our method, dubbed joint embeddings for zero-shot learning is evaluated on several benchmark datasets.
- Score: 17.86691047421871
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot learning methods rely on fixed visual and semantic embeddings,
extracted from independent vision and language models, both pre-trained for
other large-scale tasks. This is a weakness of current zero-shot learning
frameworks as such disjoint embeddings fail to adequately associate visual and
textual information to their shared semantic content. Therefore, we propose to
learn semantically grounded and enriched visual information by computing a
joint image and text model with a two-stream network on a proxy task. To
improve this alignment between image and textual representations, provided by
attributes, we leverage ancillary captions to provide grounded semantic
information. Our method, dubbed joint embeddings for zero-shot learning is
evaluated on several benchmark datasets, improving the performance of existing
state-of-the-art methods in both standard ($+1.6$\% on aPY, $+2.6\%$ on FLO)
and generalized ($+2.1\%$ on AWA$2$, $+2.2\%$ on CUB) zero-shot recognition.
Related papers
- Visual-Semantic Decomposition and Partial Alignment for Document-based Zero-Shot Learning [14.77066147494556]
We propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts.
We consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning.
arXiv Detail & Related papers (2024-07-22T13:15:04Z) - A Simple Framework for Open-Vocabulary Zero-Shot Segmentation [36.01531912271202]
SimZSS is a framework for open-vocabulary Zero-Shot segmentation.
It exploits the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions.
SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
arXiv Detail & Related papers (2024-06-23T11:57:08Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Two-Level Adversarial Visual-Semantic Coupling for Generalized Zero-shot
Learning [21.89909688056478]
We propose a new two-level joint idea to augment the generative network with an inference network during training.
This provides strong cross-modal interaction for effective transfer of knowledge between visual and semantic domains.
We evaluate our approach on four benchmark datasets against several state-of-the-art methods, and show its performance.
arXiv Detail & Related papers (2020-07-15T15:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.