Related papers: ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

URL: http://arxiv.org/abs/2411.16044v1
Date: Mon, 25 Nov 2024 02:15:30 GMT
Title: ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Authors: Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin,
Abstract summary: Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. We show that Zoom Eye consistently improves the performance of a series base MLLMs with large margin(e.g., LLaVA-v1.5-7B increases by 34.57% on $V*$ Bench and 17.88% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o.
Score: 33.675976247869016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An image, especially with high-resolution, typically consists of numerous visual elements, ranging from dominant large objects to fine-grained detailed objects. When perceiving such images, multimodal large language models~(MLLMs) face limitations due to the restricted input resolution of the pretrained vision encoder and the cluttered, dense context of the image, resulting in a focus on primary objects while easily overlooking detailed ones. In this paper, we propose Zoom Eye, a tree search algorithm designed to navigate the hierarchical and visual nature of images to capture relevant information. Zoom Eye conceptualizes an image as a tree, with each children node representing a zoomed sub-patch of the parent node and the root represents the overall image. Moreover, Zoom Eye is model-agnostic and training-free, so it enables any MLLMs to simulate human zooming actions by searching along the image tree from root to leaf nodes, seeking out pertinent information, and accurately responding to related queries. We experiment on a series of elaborate high-resolution benchmarks and the results demonstrate that Zoom Eye not only consistently improves the performance of a series base MLLMs with large margin~(e.g., LLaVA-v1.5-7B increases by 34.57\% on $V^*$ Bench and 17.88\% on HR-Bench), but also enables small 7B MLLMs to outperform strong large models such as GPT-4o. Our code is available at \href{https://github.com/om-ai-lab/ZoomEye}{https://github.com/om-ai-lab/ZoomEye}.

Related papers

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z)
Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement [24.261439217182613]
Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately.<n>We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue.<n>Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts.
arXiv Detail & Related papers (2025-06-02T13:32:35Z)
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning [31.696397337675847]
Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images. We propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method outperforms existing high-resolution strategies on four datasets using the same data.
arXiv Detail & Related papers (2025-03-10T17:51:16Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis [44.008094698200026]
This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks.<n>We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods.<n>Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging.
arXiv Detail & Related papers (2024-12-04T19:01:06Z)
ZoomLDM: Latent Diffusion Model for multi-scale image generation [57.639937071834986]
We present ZoomLDM, a diffusion model tailored for generating images across multiple scales. Central to our approach is a novel magnification-aware conditioning mechanism that utilizes self-supervised learning (SSL) embeddings. ZoomLDM achieves state-of-the-art image generation quality across all scales, excelling in the data-scarce setting of generating thumbnails of entire large images.
arXiv Detail & Related papers (2024-11-25T22:39:22Z)
LFSamba: Marry SAM with Mamba for Light Field Salient Object Detection [9.787855464038673]
A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced.
arXiv Detail & Related papers (2024-11-11T01:37:32Z)
FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations. We re-annotated the COCO and Visual Genome datasets using our FullAnno system. Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z)
AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity [85.44800864697464]
We introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. We show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference.
arXiv Detail & Related papers (2024-09-20T10:50:21Z)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding [96.01726275876548]
We present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. Our model is capable of processing images with resolutions up to $1008times 1008$.
arXiv Detail & Related papers (2024-08-30T03:16:49Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
Efficient Classification of Very Large Images with Tiny Objects [15.822654320750054]
We present an end-to-end CNN model termed Zoom-In network for classification of large images with tiny objects. We evaluate our method on two large-image datasets and one gigapixel dataset.
arXiv Detail & Related papers (2021-06-04T20:13:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.