HIRL: A General Framework for Hierarchical Image Representation Learning
- URL: http://arxiv.org/abs/2205.13159v1
- Date: Thu, 26 May 2022 05:13:26 GMT
- Title: HIRL: A General Framework for Hierarchical Image Representation Learning
- Authors: Minghao Xu, Yuanfan Guo, Xuanyu Zhu, Jiawen Li, Zhenbang Sun, Jian
Tang, Yi Xu, Bingbing Ni
- Abstract summary: We propose a general framework for Hierarchical Image Representation Learning (HIRL)
This framework aims to learn multiple semantic representations for each image, and these representations are structured to encode image semantics from fine-grained to coarse-grained.
Based on a probabilistic factorization, HIRL learns the most fine-grained semantics by an off-the-shelf image SSL approach and learns multiple coarse-grained semantics by a novel semantic path discrimination scheme.
- Score: 54.12773508883117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning self-supervised image representations has been broadly studied to
boost various visual understanding tasks. Existing methods typically learn a
single level of image semantics like pairwise semantic similarity or image
clustering patterns. However, these methods can hardly capture multiple levels
of semantic information that naturally exists in an image dataset, e.g., the
semantic hierarchy of "Persian cat to cat to mammal" encoded in an image
database for species. It is thus unknown whether an arbitrary image
self-supervised learning (SSL) approach can benefit from learning such
hierarchical semantics. To answer this question, we propose a general framework
for Hierarchical Image Representation Learning (HIRL). This framework aims to
learn multiple semantic representations for each image, and these
representations are structured to encode image semantics from fine-grained to
coarse-grained. Based on a probabilistic factorization, HIRL learns the most
fine-grained semantics by an off-the-shelf image SSL approach and learns
multiple coarse-grained semantics by a novel semantic path discrimination
scheme. We adopt six representative image SSL methods as baselines and study
how they perform under HIRL. By rigorous fair comparison, performance gain is
observed on all the six methods for diverse downstream tasks, which, for the
first time, verifies the general effectiveness of learning hierarchical image
semantics. All source code and model weights are available at
https://github.com/hirl-team/HIRL
Related papers
- Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Semantic Cross Attention for Few-shot Learning [9.529264466445236]
We propose a multi-task learning approach to view semantic features of label text as an auxiliary task.
Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
arXiv Detail & Related papers (2022-10-12T15:24:59Z) - Comprehending and Ordering Semantics for Image Captioning [124.48670699658649]
We propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net)
COS-Net unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture.
arXiv Detail & Related papers (2022-06-14T15:51:14Z) - HCSC: Hierarchical Contrastive Selective Coding [44.655310210531226]
Hierarchical Contrastive Selective Coding (HCSC) is a novel contrastive learning framework.
We introduce an elaborate pair selection scheme to make image representations better fit semantic structures.
We verify the superior performance of HCSC over state-of-the-art contrastive methods.
arXiv Detail & Related papers (2022-02-01T15:04:40Z) - Evaluating language-biased image classification based on semantic
representations [13.508894957080777]
Humans show language-biased image recognition for a word-embedded image, known as picture-word interference.
Similar to humans, recent artificial models jointly trained on texts and images, e.g., OpenAI CLIP, show language-biased image classification.
arXiv Detail & Related papers (2022-01-26T15:46:36Z) - Semantic decoupled representation learning for remote sensing image
change detection [17.548248093344576]
We propose a semantic decoupled representation learning for RS image CD.
We disentangle representations of different semantic regions by leveraging the semantic mask.
We additionally force the model to distinguish different semantic representations, which benefits the recognition of objects of interest in the downstream CD task.
arXiv Detail & Related papers (2022-01-15T07:35:26Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Hierarchical Image Classification using Entailment Cone Embeddings [68.82490011036263]
We first inject label-hierarchy knowledge into an arbitrary CNN-based classifier.
We empirically show that availability of such external semantic information in conjunction with the visual semantics from images boosts overall performance.
arXiv Detail & Related papers (2020-04-02T10:22:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.