Related papers: TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

URL: http://arxiv.org/abs/2106.11297v1
Date: Mon, 21 Jun 2021 17:55:59 GMT
Title: TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?
Authors: Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
Abstract summary: We introduce a novel visual representation learning which relies on a handful of adaptively learned tokens. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks.
Score: 89.17394772676819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount.

Related papers

Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample [30.263537686102016]
Redundant vision tokens consume a significant amount of computational memory and resources.<n>We propose Vision Remember, which is inserted between the LLM decoder layers to allow vision tokens to re-memorize vision features.<n>During the resampling process, each vision token only attends to a local region in vision features, which is referred to as saliency-enhancing local attention.
arXiv Detail & Related papers (2025-06-04T13:22:35Z)
ToDRE: Visual Token Pruning via Diversity and Task Awareness for Efficient Large Vision-Language Models [59.47738955960352]
ToDRE is a two-stage and training-free token compression framework.<n>It achieves superior performance by pruning tokens based on token Diversity and token-task RElevance.
arXiv Detail & Related papers (2025-05-24T15:47:49Z)
Beyond Intermediate States: Explaining Visual Redundancy through Language [7.275188652473603]
Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens. Visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information. We develop a more reliable method for identifying and pruning redundant visual tokens.
arXiv Detail & Related papers (2025-03-26T13:38:10Z)
[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs [66.5266435598799]
Multi-language Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision tasks. However, their efficient deployment remains a substantial challenge due to high computational costs and memory requirements. We introduce a simple yet effective method for train-free visual compression, called VTC- compression.
arXiv Detail & Related papers (2024-12-08T05:29:39Z)
Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality. We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z)
FoPru: Focal Pruning for Efficient Large Vision-Language Models [11.36025001578531]
We propose Focal Pruning (FoPru), a training-free method that prunes visual tokens based on the attention-based token significance derived from the vision encoder. Our method can prune a large number of redundant tokens while maintaining high accuracy, leading to significant improvements in inference efficiency.
arXiv Detail & Related papers (2024-11-21T14:22:38Z)
ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. During inference, ElasticTok can dynamically allocate tokens when needed. Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning [41.81009725976217]
We provide semantically-meaningful visual tokens to transformer encoders within a vision-language pre-training framework. We demonstrate notable improvements over ViTs in learned representation quality across text-to-image and image-to-text retrieval tasks.
arXiv Detail & Related papers (2024-05-26T01:46:22Z)
LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed. By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes. Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation. We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs) We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z)
Self-attention on Multi-Shifted Windows for Scene Segmentation [14.47974086177051]
We explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features. We propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our models achieve very promising performance on four public scene segmentation datasets.
arXiv Detail & Related papers (2022-07-10T07:36:36Z)
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention. We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.