Related papers: Visual Perception by Large Language Model's Weights

Visual Perception by Large Language Model's Weights

URL: http://arxiv.org/abs/2405.20339v1
Date: Thu, 30 May 2024 17:59:47 GMT
Title: Visual Perception by Large Language Model's Weights
Authors: Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun,
Abstract summary: We propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency.
Score: 34.34876575183736
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

Related papers

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance [67.26434607115392]
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We propose LACING to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG)
arXiv Detail & Related papers (2024-11-21T16:33:30Z)
Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. To reduce inference costs, one can either downsize the Large Language Models (LLMs) or reduce the number of input tokens needed to represent the image. We take the first steps toward designing token compression algorithms tailored for high-compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
PIP-MM: Pre-Integrating Prompt Information into Visual Encoding via Existing MLLM Structures [5.513631883813244]
We propose a framework that textbfPre-textbfIntegratestextbfPrompt information into the visual encoding process using existingmodules of MLLMs. Our model maintains excellent generation even when half of the visual tokens are reduced.
arXiv Detail & Related papers (2024-10-30T15:05:17Z)
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation [10.468784974994465]
The projector plays a crucial role in multi-modal language models (MLLMs) Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency. A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue.
arXiv Detail & Related papers (2024-10-14T09:25:09Z)
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See [37.7015406019386]
Multimodal Large Language Models (MLLMs) treat visual tokens from visual encoders as text tokens. As token counts grow, the quadratic scaling of computation in LLMs introduces an efficiency bottleneck. In this study, we investigate the redundancy in visual computation at both the parameter and computational pattern levels within LLaVA.
arXiv Detail & Related papers (2024-10-08T16:13:24Z)
Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application. We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue. Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation [66.00245701441547]
We introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video.
arXiv Detail & Related papers (2024-08-29T17:21:58Z)
ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
TokenPacker: Efficient Visual Projector for Multimodal LLM [37.1071749188282]
The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) We propose a novel visual projector, which adopts a coarse-to-fine scheme to inject the enriched characteristics to generate the condensed visual tokens. Our approach compresses the visual tokens by 75%89%, while achieves comparable or even better performance across diverse benchmarks.
arXiv Detail & Related papers (2024-07-02T16:10:55Z)
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning [59.13366859237086]
Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm. We consider visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. We introduce a novel approach, wherein visual prompts are memoryd with the weights of FFN for visual knowledge injection.
arXiv Detail & Related papers (2024-05-09T08:23:20Z)
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models [35.88374542519597]
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. We propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs.
arXiv Detail & Related papers (2024-03-22T17:59:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.