Related papers: ClusterFormer: Clustering As A Universal Visual Learner

ClusterFormer: Clustering As A Universal Visual Learner

URL: http://arxiv.org/abs/2309.13196v3
Date: Fri, 6 Oct 2023 00:38:16 GMT
Title: ClusterFormer: Clustering As A Universal Visual Learner
Authors: James C. Liang, Yiming Cui, Qifan Wang, Tong Geng, Wenguan Wang, Dongfang Liu
Abstract summary: CLUSTERFORMER is a universal vision model based on the CLUSTERing paradigm with TransFORMER. It is capable of tackling heterogeneous vision tasks with varying levels of clustering granularity. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
Score: 80.79669078819562
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents CLUSTERFORMER, a universal vision model that is based on the CLUSTERing paradigm with TransFORMER. It comprises two novel designs: 1. recurrent cross-attention clustering, which reformulates the cross-attention mechanism in Transformer and enables recursive updates of cluster centers to facilitate strong representation learning; and 2. feature dispatching, which uses the updated cluster centers to redistribute image features through similarity-based metrics, resulting in a transparent pipeline. This elegant design streamlines an explainable and transferable workflow, capable of tackling heterogeneous vision tasks (i.e., image classification, object detection, and image segmentation) with varying levels of clustering granularity (i.e., image-, box-, and pixel-level). Empirical results demonstrate that CLUSTERFORMER outperforms various well-known specialized architectures, achieving 83.41% top-1 acc. over ImageNet-1K for image classification, 54.2% and 47.0% mAP over MS COCO for object detection and instance segmentation, 52.4% mIoU over ADE20K for semantic segmentation, and 55.8% PQ over COCO Panoptic for panoptic segmentation. For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.

Related papers

FTCFormer: Fuzzy Token Clustering Transformer for Image Classification [22.410199372985584]
Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks.<n>Most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions.<n>We propose Fuzzy Token Clustering Transformer (FTCFormer) to dynamically generate vision tokens based on the semantic meanings instead of spatial positions.
arXiv Detail & Related papers (2025-07-14T13:49:47Z)
Structural-Spectral Graph Convolution with Evidential Edge Learning for Hyperspectral Image Clustering [59.24638672786966]
Hyperspectral image (HSI) clustering assigns similar pixels to the same class without any annotations.<n>Existing graph neural networks (GNNs) cannot fully exploit the spectral information of the input HSI.<n>We propose a structural-spectral graph convolutional operator (SSGCO) tailored for graph-structured HSI superpixels.
arXiv Detail & Related papers (2025-06-11T16:41:34Z)
Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z)
Superpixel Graph Contrastive Clustering with Semantic-Invariant Augmentations for Hyperspectral Images [64.72242126879503]
Hyperspectral images (HSI) clustering is an important but challenging task. We first use 3-D and 2-D hybrid convolutional neural networks to extract the high-order spatial and spectral features of HSI. We then design a superpixel graph contrastive clustering model to learn discriminative superpixel representations.
arXiv Detail & Related papers (2024-03-04T07:40:55Z)
Rethinking cluster-conditioned diffusion models for label-free image synthesis [1.4624458429745086]
Diffusion-based image generation models can enhance image quality when conditioned on ground truth labels. We investigate how individual clustering determinants, such as the number of clusters and the clustering method, impact image synthesis.
arXiv Detail & Related papers (2024-03-01T14:47:46Z)
Grid Jigsaw Representation with CLIP: A New Perspective on Image Clustering [37.15595383168132]
Jigsaw based strategy method for image clustering called Grid Jigsaw Representation (GJR) with systematic exposition from pixel to feature in discrepancy against human and computer. GJR modules are appended to a variety of deep convolutional networks and tested with significant improvements on a wide range of benchmark datasets. Experiment results show the effectiveness on the clustering task with respect to the ACC, NMI and ARI three metrics and super fast convergence speed.
arXiv Detail & Related papers (2023-10-27T03:07:05Z)
CVFC: Attention-Based Cross-View Feature Consistency for Weakly Supervised Semantic Segmentation of Pathology Images [3.2128744424771725]
Histopathology image segmentation is the gold standard for diagnosing cancer. Many studies now use imagelevel labels to achieve pixel-level segmentation to reduce the need for fine-grained annotation. We propose an attention-based cross-view feature consistency end-to-end pseudo-mask generation framework named CVFC.
arXiv Detail & Related papers (2023-08-21T03:50:09Z)
CLUSTSEG: Clustering for Universal Segmentation [56.58677563046506]
CLUSTSEG is a general, transformer-based framework for image segmentation. It tackles different image segmentation tasks (i.e., superpixel, semantic, instance, and panoptic) through a unified neural clustering scheme.
arXiv Detail & Related papers (2023-05-03T15:31:16Z)
Image as Set of Points [60.30495338399321]
Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction.
arXiv Detail & Related papers (2023-03-02T18:56:39Z)
A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation. In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP. Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z)
Deep Transformation-Invariant Clustering [24.23117820167443]
We present an approach that does not rely on abstract features but instead learns to predict image transformations. This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model. We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks.
arXiv Detail & Related papers (2020-06-19T13:43:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.