Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference
- URL: http://arxiv.org/abs/2412.12785v1
- Date: Tue, 17 Dec 2024 10:44:47 GMT
- Title: Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference
- Authors: Siyuan Wang, Dianyi Wang, Chengxing Zhou, Zejun Li, Zhihao Fan, Xuanjing Huang, Zhongyu Wei,
- Abstract summary: Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning.
We investigate the existence of an analogous textitvisual region within LLMs that functions as a cognitive core.
We propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss.
- Score: 46.00657360369715
- License:
- Abstract: Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Drawing inspiration from the concept of visual region in the human brain, we investigate the existence of an analogous \textit{visual region} within LLMs that functions as a cognitive core, and explore the possibility of efficient training of LVLMs via selective layers tuning. We use Bunny-Llama-3-8B-V for detailed experiments and LLaVA-1.5-7B and LLaVA-1.5-13B for validation across a range of visual and textual tasks. Our findings reveal that selectively updating 25\% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99\% of visual performance while maintaining or enhancing textual task results, and also effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which is consistently effective across different models and parameter scales.
Related papers
- Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method.
We propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm.
We also propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located.
arXiv Detail & Related papers (2025-01-01T15:58:51Z) - OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs.
With a minimalist design, our method can be applied to both video and image LLMs.
Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual prompts into Multimodal Large Language Models (MLLMs)
We optimize a learnable latent variable based on an energy function, enhancing the strength of referring regions in the attention map.
Our method offers a promising direction for integrating referring abilities into MLLMs, and supports referring with box, mask, scribble and point.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks.
By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [49.32669226551026]
We propose a simple yet effective training strategy MoE-Tuning for LVLMs.
MoE-LLaVA, a MoE-based sparse LVLM architecture, uniquely activates only the top-k experts through routers.
Experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks.
arXiv Detail & Related papers (2024-01-29T08:13:40Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - PerceptionGPT: Effectively Fusing Visual Perception into LLM [31.34127196055722]
The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs)
We present a novel end-to-end framework named PerceptionGPT, which efficiently equips the VLLMs with visual perception abilities.
Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens.
arXiv Detail & Related papers (2023-11-11T16:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.