Related papers: Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks

Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks

URL: http://arxiv.org/abs/2511.08883v1
Date: Thu, 13 Nov 2025 01:13:54 GMT
Title: Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks
Authors: Cheng Wang, Shuisheng Zhou, Fengjiao Peng, Jin Sheng, Feng Ye, Yinli Dong,
Abstract summary: We design a novel fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT)<n>MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.
Score: 9.66175383697435
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)<n>We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.<n>Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Local Consensus Enhanced Siamese Network with Reciprocal Loss for Two-view Correspondence Learning [35.5851523517487]
Two-view correspondence learning usually establish an end-to-end network to jointly predict correspondence reliability and relative pose. We propose a Local Feature Consensus (LFC) plugin block to augment the features of existing models. We extend existing models to a Siamese network with a reciprocal loss that exploits the supervision of mutual projection.
arXiv Detail & Related papers (2023-08-06T22:20:09Z)
A Simplified Framework for Contrastive Learning for Node Representations [2.277447144331876]
We investigate the potential of deploying contrastive learning in combination with Graph Neural Networks for embedding nodes in a graph. We show that the quality of the resulting embeddings and training time can be significantly improved by a simple column-wise postprocessing of the embedding matrix. This modification yields improvements in downstream classification tasks of up to 1.5% and even beats existing state-of-the-art approaches on 6 out of 8 different benchmarks.
arXiv Detail & Related papers (2023-05-01T02:04:36Z)
AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation [80.33846577924363]
We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video framegithub. It is based on two essential designs. First, we build bidirectional volumes for all pairs of pixels, and use the predicted bilateral flows to retrieve correlations. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately.
arXiv Detail & Related papers (2023-04-19T16:18:47Z)
Contrastive variational information bottleneck for aspect-based sentiment analysis [36.83876224466177]
We propose to reduce spurious correlations for aspect-based sentiment analysis (ABSA) via a novel Contrastive Variational Information Bottleneck framework (called CVIB) The proposed CVIB framework is composed of an original network and a self-pruned network, and these two networks are optimized simultaneously via contrastive learning. Our approach achieves better performance than the strong competitors in terms of overall prediction performance, robustness, and generalization.
arXiv Detail & Related papers (2023-03-06T02:52:37Z)
Cluster-guided Contrastive Graph Clustering Network [53.16233290797777]
We propose a Cluster-guided Contrastive deep Graph Clustering network (CCGC) We construct two views of the graph by designing special Siamese encoders whose weights are not shared between the sibling sub-networks. To construct semantic meaningful negative sample pairs, we regard the centers of different high-confidence clusters as negative samples.
arXiv Detail & Related papers (2023-01-03T13:42:38Z)
GraphLearner: Graph Node Clustering with Fully Learnable Augmentation [76.63963385662426]
Contrastive deep graph clustering (CDGC) leverages the power of contrastive learning to group nodes into different clusters. We propose a Graph Node Clustering with Fully Learnable Augmentation, termed GraphLearner. It introduces learnable augmentors to generate high-quality and task-specific augmented samples for CDGC.
arXiv Detail & Related papers (2022-12-07T10:19:39Z)
Vision Transformer for Contrastive Clustering [48.476602271481674]
Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) This paper presents an end-to-end deep image clustering approach termed Vision Transformer for Contrastive Clustering (VTCC)
arXiv Detail & Related papers (2022-06-26T17:00:35Z)
Semantics-Depth-Symbiosis: Deeply Coupled Semi-Supervised Learning of Semantics and Depth [83.94528876742096]
We tackle the MTL problem of two dense tasks, ie, semantic segmentation and depth estimation, and present a novel attention module called Cross-Channel Attention Module (CCAM) In a true symbiotic spirit, we then formulate a novel data augmentation for the semantic segmentation task using predicted depth called AffineMix, and a simple depth augmentation using predicted semantics called ColorAug. Finally, we validate the performance gain of the proposed method on the Cityscapes dataset, which helps us achieve state-of-the-art results for a semi-supervised joint model based on depth and semantic
arXiv Detail & Related papers (2022-06-21T17:40:55Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Mixing Consistent Deep Clustering [3.5786621294068373]
Good latent representations produce semantically mixed outputs when decoding linears of two latent representations. We propose the Mixing Consistent Deep Clustering method which encourages representations to appear realistic. We show that the proposed method can be added to existing autoencoders to further improve clustering performance.
arXiv Detail & Related papers (2020-11-03T19:47:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.