Related papers: Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens

URL: http://arxiv.org/abs/2405.13337v2
Date: Wed, 20 Nov 2024 05:17:51 GMT
Title: Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens
Authors: Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He,
Abstract summary: We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC) SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
Score: 57.37893387775829
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named \textbf{S}emantic \textbf{E}quitable \textbf{C}lustering (SEC). SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. In contrast to traditional clustering methods requiring multiple iterations, our method achieves token clustering in a single pass. Additionally, SEC regulates the number of tokens per cluster, ensuring a balanced distribution for effective parallel processing on current computational platforms without necessitating further optimization. Capitalizing on SEC, we propose a versatile vision backbone, SECViT. Comprehensive experiments in image classification, object detection, instance segmentation, and semantic segmentation validate the effectiveness of SECViT. Moreover, SEC can be conveniently and swiftly applied to multimodal large language models (MLLM), such as LLaVA, to serve as a vision language connector, effectively accelerating the model's efficiency while maintaining unchanged or better performance.

Related papers

Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios. Existing SHGL methods encounter two significant limitations. We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
Effective SAM Combination for Open-Vocabulary Semantic Segmentation [24.126307031048203]
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. ESC-Net is a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context.
arXiv Detail & Related papers (2024-11-22T04:36:12Z)
Agglomerative Token Clustering [61.0477253613511]
Agglomerative Token Clustering (ATC) is a novel token merging method that consistently outperforms previous methods. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf.
arXiv Detail & Related papers (2024-09-18T12:37:58Z)
MST: Adaptive Multi-Scale Tokens Guided Interactive Segmentation [8.46894039954642]
We propose a novel multi-scale token adaptation algorithm for interactive segmentation. By performing top-k operations across multi-scale tokens, the computational complexity is greatly simplified. We also propose a token learning algorithm based on contrastive loss.
arXiv Detail & Related papers (2024-01-09T07:59:42Z)
Rethinking Clustering-Based Pseudo-Labeling for Unsupervised Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling. This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data. We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z)
Learning the Precise Feature for Cluster Assignment [39.320210567860485]
We propose a framework which integrates representation learning and clustering into a single pipeline for the first time. The proposed framework exploits the powerful ability of recently developed generative models for learning intrinsic features. Experimental results show that the performance of the proposed method is superior, or at least comparable to, the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-11T04:08:54Z)
You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation. We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one. By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z)
Unsupervised Visual Representation Learning by Online Constrained K-Means [44.38989920488318]
Cluster discrimination is an effective pretext task for unsupervised representation learning. We propose a novel clustering-based pretext task with online textbfConstrained textbfK-mtextbfeans (textbfCoKe) Our online assignment method has a theoretical guarantee to approach the global optimum.
arXiv Detail & Related papers (2021-05-24T20:38:32Z)
Information Maximization Clustering via Multi-View Self-Labelling [9.947717243638289]
We propose a novel single-phase clustering method that simultaneously learns meaningful representations and assigns the corresponding annotations. This is achieved by integrating a discrete representation into the self-supervised paradigm through a net. Our empirical results show that the proposed framework outperforms state-of-the-art techniques with the average accuracy of 89.1% and 49.0%, respectively.
arXiv Detail & Related papers (2021-03-12T16:04:41Z)
Effective Unsupervised Domain Adaptation with Adversarially Trained Language Models [54.569004548170824]
We show that careful masking strategies can bridge the knowledge gap of masked language models. We propose an effective training strategy by adversarially masking out those tokens which are harder to adversarial by the underlying.
arXiv Detail & Related papers (2020-10-05T01:49:47Z)
Learning to Cluster Faces via Confidence and Connectivity Estimation [136.5291151775236]
We propose a fully learnable clustering framework without requiring a large number of overlapped subgraphs. Our method significantly improves clustering accuracy and thus performance of the recognition models trained on top, yet it is an order of magnitude more efficient than existing supervised methods.
arXiv Detail & Related papers (2020-04-01T13:39:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.