Task-Aware Resolution Optimization for Visual Large Language Models
- URL: http://arxiv.org/abs/2510.09822v1
- Date: Fri, 10 Oct 2025 19:53:30 GMT
- Title: Task-Aware Resolution Optimization for Visual Large Language Models
- Authors: Weiqing Luo, Zhen Tan, Yifan Li, Xinyu Zhao, Kwonjoon Lee, Behzad Dariush, Tianlong Chen,
- Abstract summary: Most visual large language models (VLLMs) pre-assume a fixed resolution for downstream tasks, which leads to subpar performance.<n>We propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors.<n>Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution.
- Score: 57.85959322093884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world vision-language applications demand varying levels of perceptual granularity. However, most existing visual large language models (VLLMs), such as LLaVA, pre-assume a fixed resolution for downstream tasks, which leads to subpar performance. To address this problem, we first conduct a comprehensive and pioneering investigation into the resolution preferences of different vision-language tasks, revealing a correlation between resolution preferences with image complexity, and uncertainty variance of the VLLM at different image input resolutions. Building on this insight, we propose an empirical formula to determine the optimal resolution for a given vision-language task, combining these two factors. Second, based on rigorous experiments, we propose a novel parameter-efficient fine-tuning technique to extend the visual input resolution of pre-trained VLLMs to the identified optimal resolution. Extensive experiments on various vision-language tasks validate the effectiveness of our method.
Related papers
- ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models [21.577488819845982]
Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images.<n>We introduce RC-Bench, a novel benchmark designed to evaluate VLM capabilities under extreme visual conditions.<n>We also propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios.
arXiv Detail & Related papers (2025-06-15T08:58:09Z) - SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decoding is a novel approach that enables Vision-Language Models to leverage multi-scale visual information with an object-centric manner.<n> SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks.
arXiv Detail & Related papers (2025-06-10T02:55:38Z) - Task-driven real-world super-resolution of document scans [41.61731067095584]
Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation.<n>We introduce a task-driven, multi-task learning framework for training a super-resolution network optimized for optical character recognition tasks.<n>We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution.
arXiv Detail & Related papers (2025-06-08T00:16:29Z) - Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation [35.50570174431677]
We propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions.<n>We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations.<n>Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions.
arXiv Detail & Related papers (2025-04-26T08:44:04Z) - High Efficiency Image Compression for Large Visual-Language Models [14.484831372497437]
Large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks.
We propose a variable image compression framework consisting of a pre-editing module and an end-to-end to achieve promising rate-accuracy performance.
arXiv Detail & Related papers (2024-07-24T07:37:12Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Visualizing the Relationship Between Encoded Linguistic Information and
Task Performance [53.223789395577796]
We study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality.
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances.
Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance.
arXiv Detail & Related papers (2022-03-29T19:03:10Z) - Multi-Scale Aligned Distillation for Low-Resolution Detection [68.96325141432078]
This paper focuses on boosting the performance of low-resolution models by distilling knowledge from a high- or multi-resolution model.
On several instance-level detection tasks and datasets, the low-resolution models trained via our approach perform competitively with high-resolution models trained via conventional multi-scale training.
arXiv Detail & Related papers (2021-09-14T12:53:35Z) - Cross-Resolution Adversarial Dual Network for Person Re-Identification
and Beyond [59.149653740463435]
Person re-identification (re-ID) aims at matching images of the same person across camera views.
Due to varying distances between cameras and persons of interest, resolution mismatch can be expected.
We propose a novel generative adversarial network to address cross-resolution person re-ID.
arXiv Detail & Related papers (2020-02-19T07:21:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.