Which Tokens to Use? Investigating Token Reduction in Vision
Transformers
- URL: http://arxiv.org/abs/2308.04657v1
- Date: Wed, 9 Aug 2023 01:51:07 GMT
- Title: Which Tokens to Use? Investigating Token Reduction in Vision
Transformers
- Authors: Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B.
Moeslund
- Abstract summary: We study the reduction patterns of 10 different token reduction methods using four image classification datasets.
We find that the Top-K pruning method is a surprisingly strong baseline.
The similarity of reduction patterns is a moderate-to-strong proxy for model performance.
- Score: 64.99704164972513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the introduction of the Vision Transformer (ViT), researchers have
sought to make ViTs more efficient by removing redundant information in the
processed tokens. While different methods have been explored to achieve this
goal, we still lack understanding of the resulting reduction patterns and how
those patterns differ across token reduction methods and datasets. To close
this gap, we set out to understand the reduction patterns of 10 different token
reduction methods using four image classification datasets. By systematically
comparing these methods on the different classification tasks, we find that the
Top-K pruning method is a surprisingly strong baseline. Through in-depth
analysis of the different methods, we determine that: the reduction patterns
are generally not consistent when varying the capacity of the backbone model,
the reduction patterns of pruning-based methods significantly differ from fixed
radial patterns, and the reduction patterns of pruning-based methods are
correlated across classification datasets. Finally we report that the
similarity of reduction patterns is a moderate-to-strong proxy for model
performance. Project page at https://vap.aau.dk/tokens.
Related papers
- SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - Learning to Rank Patches for Unbiased Image Redundancy Reduction [80.93989115541966]
Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated.
Existing approaches strive to overcome this limitation by reducing less meaningful image regions.
We propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches.
arXiv Detail & Related papers (2024-03-31T13:12:41Z) - Decoupled Prototype Learning for Reliable Test-Time Adaptation [50.779896759106784]
Test-time adaptation (TTA) is a task that continually adapts a pre-trained source model to the target domain during inference.
One popular approach involves fine-tuning model with cross-entropy loss according to estimated pseudo-labels.
This study reveals that minimizing the classification error of each sample causes the cross-entropy loss's vulnerability to label noise.
We propose a novel Decoupled Prototype Learning (DPL) method that features prototype-centric loss computation.
arXiv Detail & Related papers (2024-01-15T03:33:39Z) - Simplified Concrete Dropout -- Improving the Generation of Attribution
Masks for Fine-grained Classification [8.330791157878137]
Fine-grained classification models are often deployed to determine animal species or individuals in automated animal monitoring systems.
Attention- or gradient-based methods are commonly used to identify regions in the image that contribute the most to the classification decision.
This paper presents a solution to circumvent these computational instabilities by simplifying the CD sampling and reducing reliance on large mini-batch sizes.
arXiv Detail & Related papers (2023-07-27T13:01:49Z) - CEnt: An Entropy-based Model-agnostic Explainability Framework to
Contrast Classifiers' Decisions [2.543865489517869]
We present a novel approach to locally contrast the prediction of any classifier.
Our Contrastive Entropy-based explanation method, CEnt, approximates a model locally by a decision tree to compute entropy information of different feature splits.
CEnt is the first non-gradient-based contrastive method generating diverse counterfactuals that do not necessarily exist in the training data while satisfying immutability (ex. race) and semi-immutability (ex. age can only change in an increasing direction)
arXiv Detail & Related papers (2023-01-19T08:23:34Z) - TiCo: Transformation Invariance and Covariance Contrast for
Self-Supervised Visual Representation Learning [9.507070656654632]
We present Transformation Invariance and Covariance Contrast (TiCo) for self-supervised visual representation learning.
Our method is based on maximizing the agreement among embeddings of different distorted versions of the same image.
We show that TiCo can be viewed as a variant of MoCo with an implicit memory bank of unlimited size at no extra memory cost.
arXiv Detail & Related papers (2022-06-21T19:44:01Z) - Deblurring via Stochastic Refinement [85.42730934561101]
We present an alternative framework for blind deblurring based on conditional diffusion models.
Our method is competitive in terms of distortion metrics such as PSNR.
arXiv Detail & Related papers (2021-12-05T04:36:09Z) - Learning explanations that are hard to vary [75.30552491694066]
We show that averaging across examples can favor memorization and patchwork' solutions that sew together different strategies.
We then propose and experimentally validate a simple alternative algorithm based on a logical AND.
arXiv Detail & Related papers (2020-09-01T10:17:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.