Vision Transformer for Contrastive Clustering
- URL: http://arxiv.org/abs/2206.12925v1
- Date: Sun, 26 Jun 2022 17:00:35 GMT
- Title: Vision Transformer for Contrastive Clustering
- Authors: Hua-Bao Ling, Bowen Zhu, Dong Huang, Ding-Hua Chen, Chang-Dong Wang,
Jian-Huang Lai
- Abstract summary: Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN)
This paper presents an end-to-end deep image clustering approach termed Vision Transformer for Contrastive Clustering (VTCC)
- Score: 48.476602271481674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformer (ViT) has shown its advantages over the convolutional
neural network (CNN) with its ability to capture global long-range dependencies
for visual representation learning. Besides ViT, contrastive learning is
another popular research topic recently. While previous contrastive learning
works are mostly based on CNNs, some latest studies have attempted to jointly
model the ViT and the contrastive learning for enhanced self-supervised
learning. Despite the considerable progress, these combinations of ViT and
contrastive learning mostly focus on the instance-level contrastiveness, which
often overlook the contrastiveness of the global clustering structures and also
lack the ability to directly learn the clustering result (e.g., for images). In
view of this, this paper presents an end-to-end deep image clustering approach
termed Vision Transformer for Contrastive Clustering (VTCC), which for the
first time, to the best of our knowledge, unifies the Transformer and the
contrastive learning for the image clustering task. Specifically, with two
random augmentations performed on each image in a mini-batch, we utilize a ViT
encoder with two weight-sharing views as the backbone to learn the
representations for the augmented samples. To remedy the potential instability
of the ViT, we incorporate a convolutional stem, which uses multiple stacked
small convolutions instead of a big convolution in the patch projection layer,
to split each augmented sample into a sequence of patches. With representations
learned via the backbone, an instance projector and a cluster projector are
further utilized for the instance-level contrastive learning and the global
clustering structure learning, respectively. Extensive experiments on eight
image datasets demonstrate the stability (during the training-from-scratch) and
the superiority (in clustering performance) of VTCC over the state-of-the-art.
Related papers
- Transformer-based Clipped Contrastive Quantization Learning for
Unsupervised Image Retrieval [15.982022297570108]
Unsupervised image retrieval aims to learn the important visual characteristics without any given level to retrieve the similar images for a given query image.
In this paper, we propose a TransClippedCLR model by encoding the global context of an image using Transformer having local context through patch based processing.
Results using the proposed clipped contrastive learning are greatly improved on all datasets as compared to same backbone network with vanilla contrastive learning.
arXiv Detail & Related papers (2024-01-27T09:39:11Z) - Deep Image Clustering with Contrastive Learning and Multi-scale Graph
Convolutional Networks [58.868899595936476]
This paper presents a new deep clustering approach termed image clustering with contrastive learning and multi-scale graph convolutional networks (IcicleGCN)
Experiments on multiple image datasets demonstrate the superior clustering performance of IcicleGCN over the state-of-the-art.
arXiv Detail & Related papers (2022-07-14T19:16:56Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Do Vision Transformers See Like Convolutional Neural Networks? [45.69780772718875]
Recent work has shown that (Vision) Transformer models (ViT) can achieve comparable or even superior performance on image classification tasks.
Are they acting like convolutional networks, or learning entirely different visual representations?
We find striking differences between the two architectures, such as ViT having more uniform representations across all layers.
arXiv Detail & Related papers (2021-08-19T17:27:03Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Contrastive Learning based Hybrid Networks for Long-Tailed Image
Classification [31.647639786095993]
We propose a novel hybrid network structure composed of a supervised contrastive loss to learn image representations and a cross-entropy loss to learn classifiers.
Experiments on three long-tailed classification datasets demonstrate the advantage of the proposed contrastive learning based hybrid networks in long-tailed classification.
arXiv Detail & Related papers (2021-03-26T05:22:36Z) - Deep Transformation-Invariant Clustering [24.23117820167443]
We present an approach that does not rely on abstract features but instead learns to predict image transformations.
This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model.
We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks.
arXiv Detail & Related papers (2020-06-19T13:43:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.