Related papers: Token Fusion: Bridging the Gap between Token Pruning and Token Merging

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

URL: http://arxiv.org/abs/2312.01026v1
Date: Sat, 2 Dec 2023 04:29:19 GMT
Title: Token Fusion: Bridging the Gap between Token Pruning and Token Merging
Authors: Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, Hongxia Jin
Abstract summary: Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. We introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging.
Score: 71.84591084401458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have emerged as powerful backbones in computer vision, outperforming many traditional CNNs. However, their computational overhead, largely attributed to the self-attention mechanism, makes deployment on resource-constrained edge devices challenging. Multiple solutions rely on token pruning or token merging. In this paper, we introduce "Token Fusion" (ToFu), a method that amalgamates the benefits of both token pruning and token merging. Token pruning proves advantageous when the model exhibits sensitivity to input interpolations, while token merging is effective when the model manifests close to linear responses to inputs. We combine this to propose a new scheme called Token Fusion. Moreover, we tackle the limitations of average merging, which doesn't preserve the intrinsic feature norm, resulting in distributional shifts. To mitigate this, we introduce MLERP merging, a variant of the SLERP technique, tailored to merge multiple tokens while maintaining the norm distribution. ToFu is versatile, applicable to ViTs with or without additional training. Our empirical evaluations indicate that ToFu establishes new benchmarks in both classification and image generation tasks concerning computational efficiency and model accuracy.

Related papers

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit [45.18582668677648]
We present a training-free method to transplant tokenizers in large language models.<n>We approximate each out-of-vocabulary token as a sparse linear combination of shared tokens.<n>We show that OMP achieves best zero-shot preservation of the base model's performance.
arXiv Detail & Related papers (2025-06-07T00:51:27Z)
Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models [50.214593234229255]
We introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens.<n>On the Extreme Short Token Reduction task, our VQToken compresses sequences to just 0.07 percent of their original length while incurring only a 0.66 percent drop in accuracy on the NextQA-MC benchmark.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism. Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model [2.580765958706854]
Diffusion models are hindered by their high computational cost and slow inference. One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe)
arXiv Detail & Related papers (2025-01-01T20:16:27Z)
Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers [18.850145019462552]
Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging, which measures the similarities between token embeddings and combines the most similar pairs. Our method introduces a lightweight embedding module decoupled from the ViT forward pass to extract dedicated features for token merging. Thanks to the decoupled structure, our method can be seamlessly integrated into existing ViT backbones and trained either modularly by learning only the decoupled embeddings or end-to-end by fine-tuning.
arXiv Detail & Related papers (2024-12-13T21:17:11Z)
Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency. Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Efficient Time Series Processing for Transformers and State-Space Models through Token Merging [44.27818172708914]
Token merging has shown to considerably improve the throughput of vision transformer architectures. We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood. On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.
arXiv Detail & Related papers (2024-05-28T08:28:18Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Learned Thresholds Token Merging and Pruning for Vision Transformers [5.141687309207561]
This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task.
arXiv Detail & Related papers (2023-07-20T11:30:12Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Parameterization of Cross-Token Relations with Relative Positional Encoding for Vision MLP [52.25478388220691]
Vision multi-layer perceptrons (MLPs) have shown promising performance in computer vision tasks. They use token-mixing layers to capture cross-token interactions, as opposed to the multi-head self-attention mechanism used by Transformers. We propose a new positional spacial gating unit (PoSGU) to efficiently encode the cross-token relations for token mixing.
arXiv Detail & Related papers (2022-07-15T04:18:06Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.