Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained
Visual Categorization
- URL: http://arxiv.org/abs/2401.08860v2
- Date: Mon, 26 Feb 2024 23:39:53 GMT
- Title: Cross-Level Multi-Instance Distillation for Self-Supervised Fine-Grained
Visual Categorization
- Authors: Qi Bi and Wei Ji and Jingjun Yi and Haolan Zhan and Gui-Song Xia
- Abstract summary: We propose a Cross-level Multi-instance Distillation (CMD) framework to tackle the challenge of fine-grained pre-text representation.
Our key idea is to consider the importance of each image patch in determining the fine-grained pre-text representation by multiple instance learning.
The proposed method outperforms the contemporary method by upto 10.14% and existing state-of-the-art self-supervised learning approaches by upto 19.78% on both top-1 accuracy and Rank-1 retrieval metric.
- Score: 41.86678318006878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality annotation of fine-grained visual categories demands great
expert knowledge, which is taxing and time consuming. Alternatively, learning
fine-grained visual representation from enormous unlabeled images (e.g.,
species, brands) by self-supervised learning becomes a feasible solution.
However, recent researches find that existing self-supervised learning methods
are less qualified to represent fine-grained categories. The bottleneck lies in
that the pre-text representation is built from every patch-wise embedding,
while fine-grained categories are only determined by several key patches of an
image. In this paper, we propose a Cross-level Multi-instance Distillation
(CMD) framework to tackle the challenge. Our key idea is to consider the
importance of each image patch in determining the fine-grained pre-text
representation by multiple instance learning. To comprehensively learn the
relation between informative patches and fine-grained semantics, the
multi-instance knowledge distillation is implemented on both the region/image
crop pairs from the teacher and student net, and the region-image crops inside
the teacher / student net, which we term as intra-level multi-instance
distillation and inter-level multi-instance distillation. Extensive experiments
on CUB-200-2011, Stanford Cars and FGVC Aircraft show that the proposed method
outperforms the contemporary method by upto 10.14% and existing
state-of-the-art self-supervised learning approaches by upto 19.78% on both
top-1 accuracy and Rank-1 retrieval metric.
Related papers
- Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning.
We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view.
We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z) - DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - Facing the Void: Overcoming Missing Data in Multi-View Imagery [0.783788180051711]
We propose a novel technique for multi-view image classification robust to this problem.
The proposed method, based on state-of-the-art deep learning-based approaches and metric learning, can be easily adapted and exploited in other applications and domains.
Results show that the proposed algorithm provides improvements in multi-view image classification accuracy when compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-05-21T13:21:27Z) - Fine-Grained Visual Classification using Self Assessment Classifier [12.596520707449027]
Extracting discriminative features plays a crucial role in the fine-grained visual classification task.
In this paper, we introduce a Self Assessment, which simultaneously leverages the representation of the image and top-k prediction classes.
We show that our method achieves new state-of-the-art results on CUB200-2011, Stanford Dog, and FGVC Aircraft datasets.
arXiv Detail & Related papers (2022-05-21T07:41:27Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Class-Balanced Distillation for Long-Tailed Visual Recognition [100.10293372607222]
Real-world imagery is often characterized by a significant imbalance of the number of images per class, leading to long-tailed distributions.
In this work, we introduce a new framework, by making the key observation that a feature representation learned with instance sampling is far from optimal in a long-tailed setting.
Our main contribution is a new training method, that leverages knowledge distillation to enhance feature representations.
arXiv Detail & Related papers (2021-04-12T08:21:03Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.