Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
- URL: http://arxiv.org/abs/2403.03003v1
- Date: Tue, 5 Mar 2024 14:31:24 GMT
- Title: Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
- Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong
Ji
- Abstract summary: We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA)
MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway.
To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
- Score: 84.78513908768011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite remarkable progress, existing multimodal large language models
(MLLMs) are still inferior in granular visual recognition. Contrary to previous
works, we study this problem from the perspective of image resolution, and
reveal that a combination of low- and high-resolution visual features can
effectively mitigate this shortcoming. Based on this observation, we propose a
novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation
(MRA). In particular, MRA adopts two visual pathways for images with different
resolutions, where high-resolution visual information is embedded into the
low-resolution pathway via the novel mixture-of-resolution adapters
(MR-Adapters). This design also greatly reduces the input sequence length of
MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the
new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL)
tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g.,
+9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR
remain efficient with MRA, e.g., 20 training hours and 3$\times$ inference
speed than LLaVA-1.5. Source codes are released at:
https://github.com/luogen1996/LLaVA-HR.
Related papers
- OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.
We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations.
We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs.
With a minimalist design, our method can be applied to both video and image LLMs.
Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z) - LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval [14.136397687227111]
We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR)
LLaVA-MR enables accurate moment retrieval and contextual grounding in videos using Multimodal Large Language Models (MLLMs)
Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods.
arXiv Detail & Related papers (2024-11-21T09:34:23Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities.
We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do.
We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z) - LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation [41.05687297326706]
LLaVA-MoD is a framework designed to enable the efficient training of small-scale Multimodal Language Models.
We optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts architecture into the language model.
We also propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration.
arXiv Detail & Related papers (2024-08-28T15:52:23Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model [60.22693761583569]
We present LLaMA-Adapter V2, a parameter-efficient visual instruction model.
Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters.
Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced.
arXiv Detail & Related papers (2023-04-28T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.