ElasticTok: Adaptive Tokenization for Image and Video
- URL: http://arxiv.org/abs/2410.08368v1
- Date: Thu, 10 Oct 2024 20:54:15 GMT
- Title: ElasticTok: Adaptive Tokenization for Image and Video
- Authors: Wilson Yan, Matei Zaharia, Volodymyr Mnih, Pieter Abbeel, Aleksandra Faust, Hao Liu,
- Abstract summary: We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
- Score: 109.75935878130582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed -- more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage, paving the way for future development of more powerful multimodal models, world models, and agents.
Related papers
- Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction [93.69757398746017]
CoordTok is a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos.
CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates.
arXiv Detail & Related papers (2024-11-22T06:50:44Z) - Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model [45.01871133425388]
We propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle.
MustDrop reduces about 88.5% FLOPs on LLaVA with a compression ratio of 92.2% while maintaining comparable accuracy.
arXiv Detail & Related papers (2024-11-16T13:45:33Z) - Video Token Merging for Long-form Video Understanding [17.59960070514554]
We propose a learnable video token merging algorithm that dynamically merges tokens based on their saliency.
Our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.
arXiv Detail & Related papers (2024-10-31T09:55:32Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - How can objects help action recognition? [74.29564964727813]
We investigate how we can use knowledge of objects to design better video models.
First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens.
Second, we propose an object-aware attention module that enriches our feature representation with object information.
arXiv Detail & Related papers (2023-06-20T17:56:16Z) - Compound Tokens: Channel Fusion for Vision-Language Representation
Learning [36.19486792701684]
We present an effective method for fusing visual-and-language representations for question answering tasks.
By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods.
We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
arXiv Detail & Related papers (2022-12-02T21:09:52Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.