Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
- URL: http://arxiv.org/abs/2407.14439v1
- Date: Fri, 19 Jul 2024 16:11:15 GMT
- Title: Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
- Authors: Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie,
- Abstract summary: Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
- Score: 54.532578213126065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method that efficiently captures the most informative tokens by delving into the correlation between the [CLS] token and patch tokens. By integrating these strategies, we develop a plug-and-play adaptive compressor module that can be seamlessly incorporated into MLLMs utilizing cropping techniques. This module not only enhances the processing speed during training and inference but also maintains comparable performance. We conduct experiments with the SOTA document understanding model mPLUG-DocOwl1.5 and the effectiveness is demonstrated through extensive comparisons with other compression methods.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Compression with Global Guidance: Towards Training-free High-Resolution MLLMs Acceleration [28.311125014789905]
Multimodal large language models (MLLMs) have attracted considerable attention due to their exceptional performance in visual content understanding and reasoning.
Token compression techniques, which reduce the number of visual tokens, have demonstrated their effectiveness in reducing computational costs.
We propose a novel token compression method, GlobalCom$2$, tailored for high-resolution MLLMs.
arXiv Detail & Related papers (2025-01-09T11:57:58Z) - Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models [81.74999702045339]
Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation.
Our method aligns the logit distributions of the teacher and the student at both token and sequence levels.
At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
arXiv Detail & Related papers (2024-12-19T04:51:06Z) - iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models [24.0346607116299]
We introduce iLLaVA, a simple method that can be seamlessly deployed upon current Large Vision-Language Models (LVLMs)
iLLaVA achieves this by finding and gradually merging the redundant tokens with an accurate and fast algorithm.
On tasks across different domains including single-image, multi-images and videos, iLLaVA demonstrates strong generalizability with consistently promising efficiency.
arXiv Detail & Related papers (2024-12-09T07:22:19Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment [40.63340635482609]
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner.
We advocate for assigning distinct contributions for each text token based on its visual correlation.
We introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens.
arXiv Detail & Related papers (2024-05-28T06:44:13Z) - A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical
Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer.
Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z) - Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training.
We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training)
The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z) - TokenMixup: Efficient Attention-guided Token-level Data Augmentation for
Transformers [8.099977107670917]
TokenMixup is an efficient attention-guided token-level data augmentation method.
A variant of TokenMixup mixes tokens within a single instance, thereby enabling multi-scale feature augmentation.
Experiments show that our methods significantly improve the baseline models' performance on CIFAR and ImageNet-1K.
arXiv Detail & Related papers (2022-10-14T06:36:31Z) - SWAT: Spatial Structure Within and Among Tokens [53.525469741515884]
We argue that models can have significant gains when spatial structure is preserved during tokenization.
We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
arXiv Detail & Related papers (2021-11-26T18:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.