CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
- URL: http://arxiv.org/abs/2511.19820v1
- Date: Tue, 25 Nov 2025 01:21:26 GMT
- Title: CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception
- Authors: Miguel Carvalho, Helder Dias, Bruno Martins,
- Abstract summary: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding.<n>We introduce CropVLM as an external low-cost method for boosting performance.<n>CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal.
- Score: 4.254546679250887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ''zoom in'' on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
Related papers
- Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness [36.09698262750699]
We introduce an efficient plug-and-play module that substantially improves vision language models' reasoning over rare objects.<n>We learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions.<n> Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning.
arXiv Detail & Related papers (2026-02-23T09:02:40Z) - VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models [43.09726338623949]
Vision-Language-Action (VLA) models integrate pretrained large Vision-Language Models (VLM) into their policy backbone.<n>This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance.<n>We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters.
arXiv Detail & Related papers (2026-01-06T09:58:24Z) - On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations [53.611451075703314]
Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning.<n>We show how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks.
arXiv Detail & Related papers (2025-07-30T05:41:29Z) - Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z) - Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z) - Elevating Visual Perception in Multimodal LLMs with Visual Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z) - Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding [6.798129852396113]
We introduce a simple and effective method to improve compositional reasoning in Vision-Language Models (VLMs)
Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework.
When integrated with CLIP, our technique yields notable improvement over state-of-the-art baselines.
arXiv Detail & Related papers (2023-06-15T03:26:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.