Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
- URL: http://arxiv.org/abs/2403.03003v1
- Date: Tue, 5 Mar 2024 14:31:24 GMT
- Title: Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
- Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong
Ji
- Abstract summary: We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA)
MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway.
To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
- Score: 84.78513908768011
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite remarkable progress, existing multimodal large language models
(MLLMs) are still inferior in granular visual recognition. Contrary to previous
works, we study this problem from the perspective of image resolution, and
reveal that a combination of low- and high-resolution visual features can
effectively mitigate this shortcoming. Based on this observation, we propose a
novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation
(MRA). In particular, MRA adopts two visual pathways for images with different
resolutions, where high-resolution visual information is embedded into the
low-resolution pathway via the novel mixture-of-resolution adapters
(MR-Adapters). This design also greatly reduces the input sequence length of
MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the
new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL)
tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g.,
+9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR
remain efficient with MRA, e.g., 20 training hours and 3$\times$ inference
speed than LLaVA-1.5. Source codes are released at:
https://github.com/luogen1996/LLaVA-HR.
Related papers
- Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs [54.91212829143966]
This study explores LLaMA3's capabilities when quantized to low bit-width.
We evaluate 10 existing post-training quantization and LoRA-finetuning methods of LLaMA3 on 1-8 bits and diverse datasets.
Our experimental results indicate that LLaMA3 still suffers non-negligent degradation in linguistic and visual contexts.
arXiv Detail & Related papers (2024-04-22T10:03:03Z) - Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study [32.57246173437492]
This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses.
We conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO.
Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score.
arXiv Detail & Related papers (2024-01-31T16:38:32Z) - HiLM-D: Towards High-Resolution Understanding in Multimodal Large
Language Models for Autonomous Driving [47.274696401306514]
HiLM-D is an efficient method to incorporate HR information into MLLMs for the ROLISP task.
Our experiments reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z) - Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large
Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs)
Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters.
MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z) - LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model [60.22693761583569]
We present LLaMA-Adapter V2, a parameter-efficient visual instruction model.
Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters.
Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced.
arXiv Detail & Related papers (2023-04-28T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.