FTCFormer: Fuzzy Token Clustering Transformer for Image Classification
- URL: http://arxiv.org/abs/2507.10283v1
- Date: Mon, 14 Jul 2025 13:49:47 GMT
- Title: FTCFormer: Fuzzy Token Clustering Transformer for Image Classification
- Authors: Muyi Bao, Changyu Zeng, Yifan Wang, Zhengni Yang, Zimu Wang, Guangliang Cheng, Jun Qi, Wei Wang,
- Abstract summary: Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks.<n>Most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions.<n>We propose Fuzzy Token Clustering Transformer (FTCFormer) to dynamically generate vision tokens based on the semantic meanings instead of spatial positions.
- Score: 22.410199372985584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based deep neural networks have achieved remarkable success across various computer vision tasks, largely attributed to their long-range self-attention mechanism and scalability. However, most transformer architectures embed images into uniform, grid-based vision tokens, neglecting the underlying semantic meanings of image regions, resulting in suboptimal feature representations. To address this issue, we propose Fuzzy Token Clustering Transformer (FTCFormer), which incorporates a novel clustering-based downsampling module to dynamically generate vision tokens based on the semantic meanings instead of spatial positions. It allocates fewer tokens to less informative regions and more to represent semantically important regions, regardless of their spatial adjacency or shape irregularity. To further enhance feature extraction and representation, we propose a Density Peak Clustering-Fuzzy K-Nearest Neighbor (DPC-FKNN) mechanism for clustering center determination, a Spatial Connectivity Score (SCS) for token assignment, and a channel-wise merging (Cmerge) strategy for token merging. Extensive experiments on 32 datasets across diverse domains validate the effectiveness of FTCFormer on image classification, showing consistent improvements over the TCFormer baseline, achieving gains of improving 1.43% on five fine-grained datasets, 1.09% on six natural image datasets, 0.97% on three medical datasets and 0.55% on four remote sensing datasets. The code is available at: https://github.com/BaoBao0926/FTCFormer/tree/main.
Related papers
- No-Reference Image Quality Assessment with Global-Local Progressive Integration and Semantic-Aligned Quality Transfer [6.095342999639137]
We develop a dual-measurement framework that combines vision Transformer (ViT)-based global feature extractor and convolutional neural networks (CNNs)-based local feature extractor.<n>We introduce a semantic-aligned quality transfer method that extends the training data by automatically labeling the quality scores of diverse image content with subjective opinion scores.
arXiv Detail & Related papers (2024-08-07T16:34:32Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - ClusterFormer: Clustering As A Universal Visual Learner [80.79669078819562]
CLUSTERFORMER is a universal vision model based on the CLUSTERing paradigm with TransFORMER.
It is capable of tackling heterogeneous vision tasks with varying levels of clustering granularity.
For its efficacy, we hope our work can catalyze a paradigm shift in universal models in computer vision.
arXiv Detail & Related papers (2023-09-22T22:12:30Z) - Domain Adaptive Semantic Segmentation by Optimal Transport [13.133890240271308]
semantic scene segmentation has received a great deal of attention due to the richness of the semantic information it contains.
Current approaches are mainly based on convolutional neural networks (CNN), but they rely on a large number of labels.
We propose a domain adaptation (DA) framework based on optimal transport (OT) and attention mechanism to address this issue.
arXiv Detail & Related papers (2023-03-29T03:33:54Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Semantic Labeling of High Resolution Images Using EfficientUNets and
Transformers [5.177947445379688]
We propose a new segmentation model that combines convolutional neural networks with deep transformers.
Our results demonstrate that the proposed methodology improves segmentation accuracy compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-06-20T12:03:54Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Pyramid Fusion Transformer for Semantic Segmentation [44.57867861592341]
We propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features.
We achieve competitive performance on three widely used semantic segmentation datasets.
arXiv Detail & Related papers (2022-01-11T16:09:25Z) - MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention.
The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z) - Landmark-Aware and Part-based Ensemble Transfer Learning Network for
Facial Expression Recognition from Static images [0.5156484100374059]
Part-based Ensemble Transfer Learning network models how humans recognize facial expressions.
It consists of 5 sub-networks, in which each sub-network performs transfer learning from one of the five subsets of facial landmarks.
It requires only 3.28 $times$ $106$ FLOPS, which ensures computational efficiency for real-time deployment.
arXiv Detail & Related papers (2021-04-22T18:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.