LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
- URL: http://arxiv.org/abs/2406.20092v1
- Date: Fri, 28 Jun 2024 17:57:14 GMT
- Title: LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
- Authors: Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille,
- Abstract summary: We present the study on the analysis of redundancy concerning visual tokens and efficient training within large multi-language models.
Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy.
We introduce Visual Context, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance.
- Score: 23.966237939194514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs. Code is available at https://github.com/Beckschen/LLaVolta
Related papers
- VoCo-LLaMA: Towards Vision Compression with Large Language Models [56.20788367278211]
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window.
We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs.
Our method achieves minimal performance loss with a compression ratio of 576$times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time.
arXiv Detail & Related papers (2024-06-18T05:05:12Z) - Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - PerceptionGPT: Effectively Fusing Visual Perception into LLM [31.34127196055722]
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs)
We present a novel end-to-end framework named PerceptionGPT, which efficiently equips the VLLMs with visual perception abilities.
Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens.
arXiv Detail & Related papers (2023-11-11T16:59:20Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently.
We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction.
This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.