Tree Cross Attention
- URL: http://arxiv.org/abs/2309.17388v2
- Date: Fri, 1 Mar 2024 05:15:38 GMT
- Title: Tree Cross Attention
- Authors: Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Yoshua Bengio,
Mohamed Osama Ahmed
- Abstract summary: Tree Cross Attention (TCA) is a module based on Cross Attention that only retrieves information from a logarithmic $mathcalO(log(N))$ number of tokens for performing inference.
We show that TCA performs comparable to Cross Attention across various classification and uncertainty regression tasks while being significantly more token-efficient.
- Score: 59.8891512435847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross Attention is a popular method for retrieving information from a set of
context tokens for making predictions. At inference time, for each prediction,
Cross Attention scans the full set of $\mathcal{O}(N)$ tokens. In practice,
however, often only a small subset of tokens are required for good performance.
Methods such as Perceiver IO are cheap at inference as they distill the
information to a smaller-sized set of latent tokens $L < N$ on which cross
attention is then applied, resulting in only $\mathcal{O}(L)$ complexity.
However, in practice, as the number of input tokens and the amount of
information to distill increases, the number of latent tokens needed also
increases significantly. In this work, we propose Tree Cross Attention (TCA) -
a module based on Cross Attention that only retrieves information from a
logarithmic $\mathcal{O}(\log(N))$ number of tokens for performing inference.
TCA organizes the data in a tree structure and performs a tree search at
inference time to retrieve the relevant tokens for prediction. Leveraging TCA,
we introduce ReTreever, a flexible architecture for token-efficient inference.
We show empirically that Tree Cross Attention (TCA) performs comparable to
Cross Attention across various classification and uncertainty regression tasks
while being significantly more token-efficient. Furthermore, we compare
ReTreever against Perceiver IO, showing significant gains while using the same
number of tokens for inference.
Related papers
- Tokens on Demand: Token Condensation as Training-free Test-time Adaptation [43.09801987385207]
Token Condensation as Adaptation (TCA) is a training-free approach designed to mitigate distribution shifts encountered by vision-language models (VLMs) during test-time inference.
As the first method to explore token efficiency in test-time adaptation, TCA consistently demonstrates superior performance across cross-dataset and out-of-distribution adaptation tasks.
arXiv Detail & Related papers (2024-10-16T07:13:35Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - Semantic Equitable Clustering: A Simple, Fast and Effective Strategy for Vision Transformer [57.37893387775829]
We introduce a fast and balanced clustering method, named textbfSemantic textbfEquitable textbfClustering (SEC)
SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner.
We propose a versatile vision backbone, SecViT, which attains an impressive textbf84.2% image classification accuracy with only textbf27M parameters and textbf4.4G FLOPs.
arXiv Detail & Related papers (2024-05-22T04:49:00Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [30.972412126012884]
Chain-of-thought responses from language models improve performance across most benchmarks.
We show that transformers can use meaningless filler tokens in place of a chain of thought to solve two hard algorithmic tasks.
We find that learning to use filler tokens is difficult and requires specific, dense supervision to converge.
arXiv Detail & Related papers (2024-04-24T09:30:00Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Linear-Time Modeling of Linguistic Structure: An Order-Theoretic
Perspective [97.57162770792182]
Tasks that model the relation between pairs of tokens in a string are a vital part of understanding natural language.
We show that these exhaustive comparisons can be avoided, and, moreover, the complexity can be reduced to linear by casting the relation between tokens as a partial order over the string.
Our method predicts real numbers for each token in a string in parallel and sorts the tokens accordingly, resulting in total orders of the tokens in the string.
arXiv Detail & Related papers (2023-05-24T11:47:35Z) - Token Sparsification for Faster Medical Image Segmentation [37.25161294917211]
We reformulate segmentation as a sparse encoding -> token completion -> dense decoding (SCD) pipeline.
STP predicts importance scores with a lightweight sub-network and samples the topK tokens.
MTA restores a full token sequence by assembling both sparse output tokens and pruned multi-layer intermediate ones.
arXiv Detail & Related papers (2023-03-11T23:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.