Related papers: Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding

URL: http://arxiv.org/abs/2407.14439v1
Date: Fri, 19 Jul 2024 16:11:15 GMT
Title: Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding
Authors: Renshan Zhang, Yibo Lyu, Rui Shao, Gongwei Chen, Weili Guan, Liqiang Nie,
Abstract summary: Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
Score: 54.532578213126065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. To perform a more adaptive and efficient document understanding, we propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing. Firstly, we propose an innovative approach for assessing the pattern repetitiveness based on the correlation between each patch tokens. This method identifies redundant tokens, allowing for the determination of the sub-image's information density. Secondly, we present a token-level sampling method that efficiently captures the most informative tokens by delving into the correlation between the [CLS] token and patch tokens. By integrating these strategies, we develop a plug-and-play adaptive compressor module that can be seamlessly incorporated into MLLMs utilizing cropping techniques. This module not only enhances the processing speed during training and inference but also maintains comparable performance. We conduct experiments with the SOTA document understanding model mPLUG-DocOwl1.5 and the effectiveness is demonstrated through extensive comparisons with other compression methods.

Related papers

Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models [81.74999702045339]
Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
arXiv Detail & Related papers (2024-12-19T04:51:06Z)
Importance-Based Token Merging for Efficient Image and Video Generation [41.94334394794811]
We show that preserving high-information tokens during merging significantly improves sample quality. We propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation.
arXiv Detail & Related papers (2024-11-23T02:01:49Z)
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z)
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue. Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment [40.63340635482609]
Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. We advocate for assigning distinct contributions for each text token based on its visual correlation. We introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens.
arXiv Detail & Related papers (2024-05-28T06:44:13Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
Subobject-level Image Tokenization [60.80949852899857]
Patch-based image tokenization ignores the morphology of the visual world. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation. We show that subobject tokenization enables faster convergence and better generalization while using fewer visual tokens.
arXiv Detail & Related papers (2024-02-22T06:47:44Z)
Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z)
A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)
TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers [8.099977107670917]
TokenMixup is an efficient attention-guided token-level data augmentation method. A variant of TokenMixup mixes tokens within a single instance, thereby enabling multi-scale feature augmentation. Experiments show that our methods significantly improve the baseline models' performance on CIFAR and ImageNet-1K.
arXiv Detail & Related papers (2022-10-14T06:36:31Z)
SWAT: Spatial Structure Within and Among Tokens [53.525469741515884]
We argue that models can have significant gains when spatial structure is preserved during tokenization. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing.
arXiv Detail & Related papers (2021-11-26T18:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.