Related papers: Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

URL: http://arxiv.org/abs/2408.15556v1
Date: Wed, 28 Aug 2024 06:09:02 GMT
Title: Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Authors: Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Dacheng Tao,
Abstract summary: Multimodal large language models (MLLMs) struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. We introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. We propose Divide, Conquer and Combine (DC$2$), a novel training-free framework for enhancing MLLM perception of HR images.
Score: 57.280853324896306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have experienced significant advancements recently, but still struggle to recognize and interpret intricate details in high-resolution (HR) images effectively. While state-of-the-art (SOTA) MLLMs claim to process images at 4K resolution, existing MLLM benchmarks only support up to 2K, leaving the capabilities of SOTA models on true HR images largely untested. Furthermore, existing methods for enhancing HR image perception in MLLMs rely on computationally expensive visual instruction tuning. To address these limitations, we introduce HR-Bench, the first deliberately designed benchmark to rigorously evaluate MLLM performance on 4K&8K images. Through extensive experiments, we demonstrate that while downsampling HR images leads to vision information loss, leveraging complementary modalities, e.g., text, can effectively compensate for this loss. Building upon this insight, we propose Divide, Conquer and Combine (DC$^2$), a novel training-free framework for enhancing MLLM perception of HR images. DC$^2$ follows a three-staged approach: 1) Divide: recursively partitioning the HR image into patches and merging similar patches to minimize computational overhead, 2) Conquer: leveraging the MLLM to generate accurate textual descriptions for each image patch, and 3) Combine: utilizing the generated text descriptions to enhance the MLLM's understanding of the overall HR image. Extensive experiments show that: 1) the SOTA MLLM achieves 63% accuracy, which is markedly lower than the 87% accuracy achieved by humans on HR-Bench; 2) our DC$^2$ brings consistent and significant improvements (a relative increase of +6% on HR-Bench and +8% on general multimodal benchmarks). The benchmark and code will be released to facilitate the multimodal R&D community.

Related papers

A Training-Free, Task-Agnostic Framework for Enhancing MLLM Performance on High-Resolution Images [19.549498712690404]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in vision-language understanding, reasoning, and generation.<n>They struggle with tasks requiring fine-grained localization and reasoning in high-resolution images.<n>We propose Extract Candidate then Predict (ECP), a novel training-free, task-agnostic two-stage framework designed to enhance MLLM performance on high-resolution images.
arXiv Detail & Related papers (2025-07-14T12:14:53Z)
Improving the Reasoning of Multi-Image Grounding in MLLMs via Reinforcement Learning [28.111812077758845]
Multimodal Large Language Models (MLLMs) excel at visual grounding in single-image scenarios with textual references.<n>However, their performance degrades when handling real-world applications that involve complex multi-image compositions and multi-modal instructions.<n>We adopt a Reinforcement Learning based post-training strategy to improve the reasoning of MLLMs in multi-image grounding tasks.
arXiv Detail & Related papers (2025-07-01T13:48:57Z)
Visual Instruction Tuning with Chain of Region-of-Interest [9.111425648646229]
We propose a method called Chain of Region-of-Interest (CoRoI) for Visual Instruction Tuning.<n>CoRoI seeks to identify and prioritize the most informative regions, thereby enhancing multimodal visual comprehension and recognition.<n>Our models consistently demonstrate superior performance across diverse multimodal benchmarks and tasks.
arXiv Detail & Related papers (2025-05-11T04:44:03Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
MoTaDual: Modality-Task Dual Alignment for Enhanced Zero-shot Composed Image Retrieval [20.612534837883892]
Composed Image Retrieval (CIR) is a challenging vision-language task, utilizing bi-modal (image+text) queries to retrieve target images. In this paper, we propose a two-stage framework to tackle both discrepancies. MoTaDual achieves the state-of-the-art performance across four widely used ZS-CIR benchmarks, while maintaining low training time and computational cost.
arXiv Detail & Related papers (2024-10-31T08:49:05Z)
Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input. Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints. We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z)
II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models [49.070801221350486]
multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. We propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images.
arXiv Detail & Related papers (2024-06-09T17:25:47Z)
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models [84.78513908768011]
We propose a novel and efficient method for MLLMs, termed Mixture-of-Resolution Adaptation (MRA) MRA adopts two visual pathways for images with different resolutions, where high-resolution visual information is embedded into the low-resolution pathway. To validate MRA, we apply it to a recent MLLM called LLaVA, and term the new model LLaVA-HR.
arXiv Detail & Related papers (2024-03-05T14:31:24Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers. We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images. We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)
HiLM-D: Towards High-Resolution Understanding in Multimodal Large Language Models for Autonomous Driving [47.274696401306514]
HiLM-D is an efficient method to incorporate HR information into MLLMs for the ROLISP task. Our experiments reveal HiLM-D's notable advantage over leading MLLMs, with improvements of 4.8% in BLEU-4 for captioning and 17.2% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.