Related papers: Vision-centric Token Compression in Large Language Model

Vision-centric Token Compression in Large Language Model

URL: http://arxiv.org/abs/2502.00791v2
Date: Tue, 04 Feb 2025 11:45:52 GMT
Title: Vision-centric Token Compression in Large Language Model
Authors: Ling Xing, Alex Jinpeng Wang, Rui Yan, Jinhui Tang,
Abstract summary: We show that a smaller vision encoder, applied directly to sequences of text tokens, can rival text encoders on text tasks.<n>VIST leads to comparable results with 16% fewer FLOPs and 50% less memory usage.<n>This approach delivers remarkable results, outperforming traditional text encoder-based methods by 5.7% on average over benchmarks like TriviaQA, NQ, PopQA, TREF, SST2, and SST5.
Score: 43.36321098385599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have revolutionized natural language processing, excelling in handling longer sequences. However, the inefficiency and redundancy in processing extended in-context tokens remain a challenge. Many attempts to address this rely on compressing tokens with smaller text encoders, yet we question whether text encoders are truly indispensable. Our journey leads to an unexpected discovery-a much smaller vision encoder, applied directly to sequences of text tokens, can rival text encoders on text tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small text understanding benchmarks, VIST leads to comparable results with 16% fewer FLOPs and 50% less memory usage. We further uncover significant token redundancy and devise a frequency-based masking strategy to guide the focus of the visual encoder toward the most critical tokens. Interestingly, we observe the trained visual encoder performs like a summarizer, selectively ignoring less important words such as prepositions and conjunctions. This approach delivers remarkable results, outperforming traditional text encoder-based methods by 5.7% on average over benchmarks like TriviaQA, NQ, PopQA, TREF, SST2, and SST5, setting a new standard for token efficiency in LLMs.

Related papers

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image. We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm. It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Learning with Unmasked Tokens Drives Stronger Vision Learners [39.752789949834536]
Masked image modeling (MIM) has become a leading self-supervised learning strategy. We improve MIM by explicitly incorporating unmasked tokens into the training process. We achieve 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain.
arXiv Detail & Related papers (2023-10-20T15:42:47Z)
Semantic Compression With Large Language Models [1.0874100424278175]
Large language models (LLMs) are revolutionizing information retrieval, question answering, summarization, and code generation tasks. LLMs are inherently limited by the number of input and output tokens that can be processed at once. This paper presents three contributions to research on LLMs.
arXiv Detail & Related papers (2023-04-25T01:47:05Z)
DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z)
Memory Augmented Lookup Dictionary based Language Modeling for Automatic Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z)
Efficient Long-Text Understanding with Short-Text Models [38.8375175429553]
SLED is a simple approach for processing long sequences that re-uses and leverages battle-tested short-text pretrained LMs. We partition the input into overlapping chunks, encode each with a short-text LM encoder and use the pretrained decoder to fuse information across chunks. We find that SLED is competitive with specialized models that are up to 50x larger and require a dedicated and expensive pretraining step.
arXiv Detail & Related papers (2022-08-01T11:14:39Z)
Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder. We train a Transformer-based sequence encoder over a large set of short sequences. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.