Related papers: ElasticTok: Adaptive Tokenization for Image and Video

ElasticTok: Adaptive Tokenization for Image and Video

URL: http://arxiv.org/abs/2410.08368v2
Date: Sun, 02 Feb 2025 19:28:27 GMT
Title: ElasticTok: Adaptive Tokenization for Image and Video
Authors: Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, Hao Liu,
Abstract summary: We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.<n>During inference, ElasticTok can dynamically allocate tokens when needed.<n>Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
Score: 109.75935878130582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed -- more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.

Related papers

Token Dynamics: Towards Efficient and Dynamic Video Token Representation for Video Large Language Models [50.214593234229255]
We introduce the novel task of extreme short token reduction, aiming to represent extensive video sequences with a minimal number of tokens. We propose Token Dynamics, a new video representation framework that dynamically reduces token count while preserving spatial-temporal coherence. Experiments demonstrate a reduction of token count to merely 0.07% of the original tokens, with only a minor performance drop of 1.13%.
arXiv Detail & Related papers (2025-03-21T09:46:31Z)
Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction [93.69757398746017]
CoordTok is a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos. CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates.
arXiv Detail & Related papers (2024-11-22T06:50:44Z)
Principles of Visual Tokens for Efficient Video Understanding [36.05950369461622]
We propose a lightweight video model, LITE, that can select a small number of tokens effectively. We show that LITE generalizes across datasets and even other tasks without the need for retraining.
arXiv Detail & Related papers (2024-11-20T14:09:47Z)
Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model [45.01871133425388]
We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle. MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
arXiv Detail & Related papers (2024-11-16T13:45:33Z)
Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency. Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens. Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z)
Compound Tokens: Channel Fusion for Vision-Language Representation Learning [36.19486792701684]
We present an effective method for fusing visual-and-language representations for question answering tasks. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
arXiv Detail & Related papers (2022-12-02T21:09:52Z)
VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections. We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains. We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.