Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering
- URL: http://arxiv.org/abs/2310.17869v2
- Date: Thu, 13 Feb 2025 10:02:14 GMT
- Title: Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering
- Authors: Zijie Song, Zhenzhen Hu, Richang Hong,
- Abstract summary: We propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR)
Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure.
Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering.
- Score: 33.05984601411495
- License:
- Abstract: Unsupervised representation learning for image clustering is essential in computer vision. Although the advancement of visual models has improved image clustering with efficient visual representations, challenges still remain. Firstly, existing features often lack the ability to represent the internal structure of images, hindering the accurate clustering of visually similar images. Secondly, finer-grained semantic labels are often missing, limiting the ability to capture nuanced differences and similarities between images. In this paper, we propose a new perspective on image clustering, the pretrain-based Grid Jigsaw Representation (pGJR). Inspired by human jigsaw puzzle processing, we modify the traditional jigsaw learning to gain a more sequential and incremental understanding of image structure. We also leverage the pretrained CLIP to extract the prior features which can benefit from the enhanced cross-modal representation for richer and more nuanced semantic information and label level differentiation. Our experiments demonstrate that using the pretrained model as a feature extractor can accelerate the convergence of clustering. We append the GJR module to pGJR and observe significant improvements on common-use benchmark datasets. The experimental results highlight the effectiveness of our approach in the clustering task, as evidenced by improvements in the ACC, NMI, and ARI metrics, as well as the super-fast convergence speed.
Related papers
- Dual Advancement of Representation Learning and Clustering for Sparse and Noisy Images [14.836487514037994]
Sparse and noisy images (SNIs) pose significant challenges for effective representation learning and clustering.
We propose Dual Advancement of Representation Learning and Clustering (DARLC) to enhance the representations derived from masked image modeling.
Our framework offers a comprehensive approach that improves the learning of representations by enhancing their local perceptibility, distinctiveness, and the understanding of relational semantics.
arXiv Detail & Related papers (2024-09-03T10:52:27Z) - Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis.
We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data.
FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z) - ClusterFormer: Clustering As A Universal Visual Learner [80.79669078819562]
CLUSTERFORMER is a universal vision model based on the CLUSTERing paradigm with TransFORMER.
It is capable of tackling heterogeneous vision tasks with varying levels of clustering granularity.
For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
arXiv Detail & Related papers (2023-09-22T22:12:30Z) - Image Clustering via the Principle of Rate Reduction in the Age of Pretrained Models [37.574691902971296]
We propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models.
We show that our pipeline works well on standard datasets such as CIFAR-10, CIFAR-100, and ImageNet-1k.
arXiv Detail & Related papers (2023-06-08T15:20:27Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Vision Transformer for Contrastive Clustering [48.476602271481674]
Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN)
This paper presents an end-to-end deep image clustering approach termed Vision Transformer for Contrastive Clustering (VTCC)
arXiv Detail & Related papers (2022-06-26T17:00:35Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - G-SimCLR : Self-Supervised Contrastive Learning with Guided Projection
via Pseudo Labelling [0.8164433158925593]
In computer vision, it is evident that deep neural networks perform better in a supervised setting with a large amount of labeled data.
In this work, we propose that, with the normalized temperature-scaled cross-entropy (NT-Xent) loss function, it is beneficial to not have images of the same category in the same batch.
We use the latent space representation of a denoising autoencoder trained on the unlabeled dataset and cluster them with k-means to obtain pseudo labels.
arXiv Detail & Related papers (2020-09-25T02:25:37Z) - Deep Transformation-Invariant Clustering [24.23117820167443]
We present an approach that does not rely on abstract features but instead learns to predict image transformations.
This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model.
We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks.
arXiv Detail & Related papers (2020-06-19T13:43:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.