Related papers: VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

URL: http://arxiv.org/abs/2509.25916v1
Date: Tue, 30 Sep 2025 08:10:56 GMT
Title: VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
Authors: Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, Tiancheng Zhao,
Abstract summary: Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization.<n>We introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception into a robust feature retrieval task.<n>Our method operates as a plug-and-play module that integrates with any pre-trained VLM.
Score: 13.486495756813078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion [7.382475458362566]
We present BREATH-VL, a hybrid framework that integrates semantic cues from vision-language models with geometric information from registration methods for accurate 6-DoF pose estimation.<n>Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline.
arXiv Detail & Related papers (2026-01-07T09:00:52Z)
Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z)
Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding [0.3262230127283452]
Hierarchical Contextual Grounding LVLM (HCG-LVLM) is a novel architecture that mimics human coarse-to-fine cognitive processing.<n>Our model achieves superior accuracy and significantly reduces hallucination, validating the effectiveness of its hierarchical design.
arXiv Detail & Related papers (2025-08-23T09:57:52Z)
Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z)
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z)
CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models [16.91226496250909]
We break down multi-modal understanding into two stages, from Coarse to Fine.<n>In the first stage, we prompt the MLLM to locate the approximate area of the answer.<n>In the second stage, we further enhance the model's focus on relevant areas through visual prompt engineering.
arXiv Detail & Related papers (2024-12-22T05:42:40Z)
Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation [109.5893580175657]
In recent times, the standard practice for developing MLLMs is to feed features from vision encoder(s) into the LLM and train with natural language supervision.<n>This approach often causes models to lean towards language comprehension and undermine the rich visual perception signals present in the data.<n>We propose VisPer-LM, the first approach that infuses visual perception knowledge from expert vision encoders into the LLM's hidden representations.
arXiv Detail & Related papers (2024-12-12T18:55:18Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [58.82222646803248]
Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge. We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
arXiv Detail & Related papers (2023-11-20T15:56:44Z)
GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks. GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks. We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.