What Remains of Visual Semantic Embeddings
- URL: http://arxiv.org/abs/2107.11991v1
- Date: Mon, 26 Jul 2021 06:55:11 GMT
- Title: What Remains of Visual Semantic Embeddings
- Authors: Yue Jiao, Jonathon Hare, Adam Pr\"ugel-Bennett
- Abstract summary: We introduce the split of tiered-ImageNet to the ZSL task to avoid the structural flaws in the standard ImageNet benchmark.
We build a unified framework for ZSL with contrastive learning as pre-training, which guarantees no semantic information leakage.
Our work makes it fair for evaluating visual semantic embedding models on a ZSL setting in which semantic inference is decisive.
- Score: 0.618778092044887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero shot learning (ZSL) has seen a surge in interest over the decade for its
tight links with the mechanism making young children recognize novel objects.
Although different paradigms of visual semantic embedding models are designed
to align visual features and distributed word representations, it is unclear to
what extent current ZSL models encode semantic information from distributed
word representations. In this work, we introduce the split of tiered-ImageNet
to the ZSL task, in order to avoid the structural flaws in the standard
ImageNet benchmark. We build a unified framework for ZSL with contrastive
learning as pre-training, which guarantees no semantic information leakage and
encourages linearly separable visual features. Our work makes it fair for
evaluating visual semantic embedding models on a ZSL setting in which semantic
inference is decisive. With this framework, we show that current ZSL models
struggle with encoding semantic relationships from word analogy and word
hierarchy. Our analyses provide motivation for exploring the role of context
language representations in ZSL tasks.
Related papers
- ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning [28.52949450389388]
Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones.
We propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL.
ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF)
arXiv Detail & Related papers (2024-08-27T08:39:47Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Prompting Language-Informed Distribution for Compositional Zero-Shot Learning [73.49852821602057]
Compositional zero-shot learning (CZSL) task aims to recognize unseen compositional visual concepts.
We propose a model by prompting the language-informed distribution, aka., PLID, for the task.
Experimental results on MIT-States, UT-Zappos, and C-GQA datasets show the superior performance of the PLID to the prior arts.
arXiv Detail & Related papers (2023-05-23T18:00:22Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - Language Models as Zero-shot Visual Semantic Learners [0.618778092044887]
We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
arXiv Detail & Related papers (2021-07-26T08:22:55Z) - Learning Robust Visual-semantic Mapping for Zero-shot Learning [8.299945169799795]
We focus on fully empowering the semantic feature space, which is one of the key building blocks of Zero-shot learning (ZSL)
In ZSL, the common practice is to train a mapping function between the visual and semantic feature spaces with labeled seen class examples.
Under such a paradigm, the ZSL models may easily suffer from the domain shift problem when constructing and reusing the mapping function.
arXiv Detail & Related papers (2021-04-12T17:39:38Z) - Information Bottleneck Constrained Latent Bidirectional Embedding for
Zero-Shot Learning [59.58381904522967]
We propose a novel embedding based generative model with a tight visual-semantic coupling constraint.
We learn a unified latent space that calibrates the embedded parametric distributions of both visual and semantic spaces.
Our method can be easily extended to transductive ZSL setting by generating labels for unseen images.
arXiv Detail & Related papers (2020-09-16T03:54:12Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.