ElasticTok: Adaptive Tokenization for Image and Video
- URL: http://arxiv.org/abs/2410.08368v2
- Date: Sun, 02 Feb 2025 19:28:27 GMT
- Title: ElasticTok: Adaptive Tokenization for Image and Video
- Authors: Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter Abbeel, Hao Liu,
- Abstract summary: We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
- Score: 109.75935878130582
- License:
- Abstract: Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed -- more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.
Related papers
- FlexTok: Resampling Images into 1D Token Sequences of Flexible Length [16.76602756308683]
We introduce FlexTok, a tokenizer that projects 2D images into variable-length, ordered 1D token sequences.
We evaluate our approach in an autoregressive generation setting using a simple GPT-style Transformer.
arXiv Detail & Related papers (2025-02-19T18:59:44Z) - Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction [93.69757398746017]
CoordTok is a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos.
CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates.
arXiv Detail & Related papers (2024-11-22T06:50:44Z) - Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model [45.01871133425388]
We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle.
MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
arXiv Detail & Related papers (2024-11-16T13:45:33Z) - Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency.
Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models.
First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens.
Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.