Token Fusion: Bridging the Gap between Token Pruning and Token Merging
- URL: http://arxiv.org/abs/2312.01026v1
- Date: Sat, 2 Dec 2023 04:29:19 GMT
- Title: Token Fusion: Bridging the Gap between Token Pruning and Token Merging
- Authors: Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
- Abstract summary: Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs.
computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging.
We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
- Score: 71.84591084401458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have emerged as powerful backbones in computer
vision, outperforming many traditional CNNs. However, their computational
overhead, largely attributed to the self-attention mechanism, makes deployment
on resource-constrained edge devices challenging. Multiple solutions rely on
token pruning or token merging. In this paper, we introduce "Token Fusion"
(ToFu), a method that amalgamates the benefits of both token pruning and token
merging. Token pruning proves advantageous when the model exhibits sensitivity
to input interpolations, while token merging is effective when the model
manifests close to linear responses to inputs. We combine this to propose a new
scheme called Token Fusion. Moreover, we tackle the limitations of average
merging, which doesn't preserve the intrinsic feature norm, resulting in
distributional shifts. To mitigate this, we introduce MLERP merging, a variant
of the SLERP technique, tailored to merge multiple tokens while maintaining the
norm distribution. ToFu is versatile, applicable to ViTs with or without
additional training. Our empirical evaluations indicate that ToFu establishes
new benchmarks in both classification and image generation tasks concerning
computational efficiency and model accuracy.
Related papers
- Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency.
Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Efficient Time Series Processing for Transformers and State-Space Models through Token Merging [44.27818172708914]
Token merging has shown to considerably improve the throughput of vision transformer architectures.
We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood.
On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.
arXiv Detail & Related papers (2024-05-28T08:28:18Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Learned Thresholds Token Merging and Pruning for Vision Transformers [5.141687309207561]
This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning.
We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
arXiv Detail & Related papers (2023-07-20T11:30:12Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Parameterization of Cross-Token Relations with Relative Positional
Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks.
They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers.
We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.