Related papers: Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

URL: http://arxiv.org/abs/2403.03003v1
Date: Tue, 5 Mar 2024 14:31:24 GMT
Title: Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models
Authors: Gen Luo, Yiyi Zhou, Yuxin Zhang, Xiawu Zheng, Xiaoshuai Sun, Rongrong Ji
Abstract summary: We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA) MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
Score: 84.78513908768011
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite remarkable progress, existing multimodal large language models (MLLMs) are still inferior in granular visual recognition. Contrary to previous works, we study this problem from the perspective of image resolution, and reveal that a combination of low- and high-resolution visual features can effectively mitigate this shortcoming. Based on this observation, we propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA). In particular, MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway via the novel mixture-of-resolution adapters (MR-Adapters). This design also greatly reduces the input sequence length of MLLMs. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR. We conduct extensive experiments on 11 vision-language (VL) tasks, which show that LLaVA-HR outperforms existing MLLMs on 8 VL tasks, e.g., +9.4% on TextVQA. More importantly, both training and inference of LLaVA-HR remain efficient with MRA, e.g., 20 training hours and 3$\times$ inference speed than LLaVA-1.5. Source codes are released at: https://github.com/luogen1996/LLaVA-HR.

Related papers

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering [53.82094608038132]
Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision.<n>ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples.<n>We propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs.
arXiv Detail & Related papers (2025-06-07T19:37:22Z)
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z)
Dynamic Pyramid Network for Efficient Multimodal Large Language Model [11.864416286283399]
Multimodal large language models (MLLMs) have demonstrated impressive performance in various vision-language (VL) tasks.<n>Recent efforts aim to compress the visual features to save the computational costs of MLLMs.<n>We propose a novel dynamic pyramid network (DPN) for efficient MLLMs.
arXiv Detail & Related papers (2025-03-26T08:44:11Z)
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [95.78870389271832]
The standard practice for developing contemporary MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision. We propose OLA-VLM, the first approach distilling knowledge into the LLM's hidden representations from a set of target visual representations. We show that OLA-VLM boosts performance by an average margin of up to 2.5% on various benchmarks, with a notable improvement of 8.7% on the Depth task in CV-Bench.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval [14.136397687227111]
We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR) LLaVA-MR enables accurate moment retrieval and contextual grounding in videos using Multimodal Large Language Models (MLLMs) Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods.
arXiv Detail & Related papers (2024-11-21T09:34:23Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM. We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
Rethinking VLMs and LLMs for Image Classification [6.550471260627169]
Large Language Models (LLMs) are increasingly being merged with Visual Language Models (VLMs) to enable new capabilities. We show that, for object and scene recognition, VLMs that do not leverage LLMs can achieve better performance than VLMs that do. We propose a pragmatic solution: a lightweight fix involving a relatively small LLM that efficiently routes visual tasks to the most suitable model for the task.
arXiv Detail & Related papers (2024-10-03T23:40:21Z)
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation [41.05687297326706]
LLaVA-MoD is a framework designed to enable the efficient training of small-scale Multimodal Language Models. We optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts architecture into the language model. We also propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration.
arXiv Detail & Related papers (2024-08-28T15:52:23Z)
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models [57.280853324896306]
Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
arXiv Detail & Related papers (2024-08-28T06:09:02Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study [32.57246173437492]
This paper presents an empirical study on enhancing MLLMs with state-of-the-art (SOTA) object detection and Optical Character Recognition (OCR) models to improve fine-grained understanding and reduce hallucination in responses. We conduct systematic and extensive experiments with representative models such as LLaVA-1.5, DINO, PaddleOCRv2, and Grounding DINO. Notably, the enhanced LLaVA-1.5 outperforms its original 7B/13B models on all 10 benchmarks, achieving an improvement of up to 12.5% on the normalized average score.
arXiv Detail & Related papers (2024-01-31T16:38:32Z)
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models [77.2078051555533]
We propose a novel and affordable solution for the effective VL adaption of large language models (LLMs) Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters. MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions.
arXiv Detail & Related papers (2023-05-24T11:06:15Z)
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model [60.22693761583569]
We present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced.
arXiv Detail & Related papers (2023-04-28T17:59:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.