VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
- URL: http://arxiv.org/abs/2203.10444v2
- Date: Fri, 26 May 2023 09:10:13 GMT
- Title: VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
- Authors: Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, Zeynep Akata
- Abstract summary: We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
- Score: 113.50220968583353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-annotated attributes serve as powerful semantic embeddings in zero-shot
learning. However, their annotation process is labor-intensive and needs expert
supervision. Current unsupervised semantic embeddings, i.e., word embeddings,
enable knowledge transfer between classes. However, word embeddings do not
always reflect visual similarities and result in inferior zero-shot
performance. We propose to discover semantic embeddings containing
discriminative visual properties for zero-shot learning, without requiring any
human annotation. Our model visually divides a set of images from seen classes
into clusters of local image regions according to their visual similarity, and
further imposes their class discrimination and semantic relatedness. To
associate these clusters with previously unseen classes, we use external
knowledge, e.g., word embeddings and propose a novel class relation discovery
module. Through quantitative and qualitative evaluation, we demonstrate that
our model discovers semantic embeddings that model the visual properties of
both seen and unseen classes. Furthermore, we demonstrate on three benchmarks
that our visually-grounded semantic embeddings further improve performance over
word embeddings across various ZSL models by a large margin.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - Primitive Generation and Semantic-related Alignment for Universal
Zero-Shot Segmentation [13.001629605405954]
We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples.
We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces.
The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2023-06-19T17:59:16Z) - Exploiting Semantic Attributes for Transductive Zero-Shot Learning [97.61371730534258]
Zero-shot learning aims to recognize unseen classes by generalizing the relation between visual features and semantic attributes learned from the seen classes.
We present a novel transductive ZSL method that produces semantic attributes of the unseen data and imposes them on the generative process.
Experiments on five standard benchmarks show that our method yields state-of-the-art results for zero-shot learning.
arXiv Detail & Related papers (2023-03-17T09:09:48Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z) - Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition.
Prior works use visual features extracted with a backbone network, pre-trained for object classification.
We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z) - Language Models as Zero-shot Visual Semantic Learners [0.618778092044887]
We propose a Visual Se-mantic Embedding Probe (VSEP) to probe the semantic information of contextualized word embeddings.
The VSEP with contextual representations can distinguish word-level object representations in complicated scenes as a compositional zero-shot learner.
We find that contextual representations in language mod-els outperform static word embeddings, when the compositional chain of object is short.
arXiv Detail & Related papers (2021-07-26T08:22:55Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action
Recognition [52.66360172784038]
We propose a clustering-based model, which considers all training samples at once, instead of optimizing for each instance individually.
We call the proposed method CLASTER and observe that it consistently improves over the state-of-the-art in all standard datasets.
arXiv Detail & Related papers (2021-01-18T12:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.