Random Wins All: Rethinking Grouping Strategies for Vision Tokens
- URL: http://arxiv.org/abs/2603.00486v1
- Date: Sat, 28 Feb 2026 05:59:25 GMT
- Title: Random Wins All: Rethinking Grouping Strategies for Vision Tokens
- Authors: Qihang Fan, Yuang Ai, Huaibo Huang, Ran He,
- Abstract summary: A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token.<n>We propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens.
- Score: 42.61073068532527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since Transformers are introduced into vision architectures, their quadratic complexity has always been a significant issue that many research efforts aim to address. A representative approach involves grouping tokens, performing self-attention calculations within each group, or pooling the tokens within each group into a single token. To this end, various carefully designed grouping strategies have been proposed to enhance the performance of Vision Transformers. Here, we pose the following questions: \textbf{Are these carefully designed grouping methods truly necessary? Is there a simpler and more unified token grouping method that can replace these diverse methods?} Therefore, we propose the random grouping strategy, which involves a simple and fast random grouping strategy for vision tokens. We validate this approach on multiple baselines, and experiments show that random grouping almost outperforms all other grouping methods. When transferred to downstream tasks, such as object detection, random grouping demonstrates even more pronounced advantages. In response to this phenomenon, we conduct a detailed analysis of the advantages of random grouping from multiple perspectives and identify several crucial elements for the design of grouping strategies: positional information, head feature diversity, global receptive field, and fixed grouping pattern. We demonstrate that as long as these four conditions are met, vision tokens require only an extremely simple grouping strategy to efficiently and effectively handle various visual tasks. We also validate the effectiveness of our proposed random method across multiple modalities, including visual tasks, point cloud processing, and vision-language models. Code will be available at https://github.com/qhfan/random.
Related papers
- GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning [57.888537648437115]
Group Context Optimization (GroupCoOp) is a simple and effective debiased fine-tuning algorithm.<n>It enhances the group robustness of fine-tuned vision-language models (VLMs)<n>GroupCoOp achieved the best results on five benchmarks across five CLIP architectures.
arXiv Detail & Related papers (2025-09-28T09:54:30Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality.<n>We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z) - Vision Transformer based Random Walk for Group Re-Identification [15.63292108454152]
Group re-identification (re-ID) aims to match groups with the same people under different cameras.
We propose a novel vision transformer based random walk framework for group re-ID.
arXiv Detail & Related papers (2024-10-08T08:41:14Z) - The Research of Group Re-identification from Multiple Cameras [0.4955551943523977]
Group re-identification is very challenging since it is not only interfered by view-point and human pose variations in the traditional re-identification tasks.
This paper introduces a novel approach which leverages the multi-granularity information inside groups to facilitate group re-identification.
arXiv Detail & Related papers (2024-07-19T18:28:13Z) - Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens [57.37893387775829]
We introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC)<n>SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.<n>We propose a versatile vision backbone, SECViT, to serve as a vision language connector.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Rethinking Sampling Strategies for Unsupervised Person Re-identification [59.47536050785886]
We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function.<n>Group sampling is proposed, which gathers samples from the same class into groups.<n>Experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-07-07T05:39:58Z) - Portfolio Search and Optimization for General Strategy Game-Playing [58.896302717975445]
We propose a new algorithm for optimization and action-selection based on the Rolling Horizon Evolutionary Algorithm.
For the optimization of the agents' parameters and portfolio sets we study the use of the N-tuple Bandit Evolutionary Algorithm.
An analysis of the agents' performance shows that the proposed algorithm generalizes well to all game-modes and is able to outperform other portfolio methods.
arXiv Detail & Related papers (2021-04-21T09:28:28Z) - Few-shot Knowledge Transfer for Fine-grained Cartoon Face Generation [11.951522183013811]
We propose a two-stage training process to generate cartoon faces for various groups.
First, a basic translation model for the basic group (which consists of sufficient data) is trained.
Then, given new samples of other groups, we extend the basic model by creating group-specific branches for each new group.
arXiv Detail & Related papers (2020-07-27T07:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.