Related papers: Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning

URL: http://arxiv.org/abs/2602.22703v1
Date: Thu, 26 Feb 2026 07:28:04 GMT
Title: Enhancing Geometric Perception in VLMs via Translator-Guided Reinforcement Learning
Authors: Hao Yu, Shuning Jia, Guanghao Li, Wenhao Jiang, Chun Yuan,
Abstract summary: Vision-guided models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements.<n>We introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language representations.<n>We propose GeoDPO, a translator reinforcement learning framework.
Score: 52.075928878249066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) often struggle with geometric reasoning due to their limited perception of fundamental diagram elements. To tackle this challenge, we introduce GeoPerceive, a benchmark comprising diagram instances paired with domain-specific language (DSL) representations, along with an efficient automatic data generation pipeline. This design enables the isolated evaluation of geometric perception independently from reasoning. To exploit the data provided by GeoPerceive for enhancing the geometric perception capabilities of VLMs, we propose GeoDPO, a translator-guided reinforcement learning (RL) framework. GeoDPO employs an NL-to-DSL translator, which is trained on synthetic pairs generated by the data engine of GeoPerceive, to bridge natural language and DSL. This translator facilitates the computation of fine-grained, DSL-level scores, which serve as reward signals in reinforcement learning. We assess GeoDPO on both in-domain and out-of-domain datasets, spanning tasks in geometric perception as well as downstream reasoning. Experimental results demonstrate that, while supervised fine-tuning (SFT) offers only marginal improvements and may even impair performance in out-of-domain scenarios, GeoDPO achieves substantial gains: $+26.5\%$ on in-domain data, $+8.0\%$ on out-of-domain data, and $+39.0\%$ on downstream reasoning tasks. These findings underscore the superior performance and generalization ability of GeoDPO over SFT. All codes are released at https://github.com/Longin-Yu/GeoPerceive to ensure reproducibility.

Related papers

GeoGR: A Generative Retrieval Framework for Spatio-Temporal Aware POI Recommendation [15.009742536403763]
GeoGR is a geographic generative recommendation framework tailored for navigation-based LBS like AMAP.<n>It perceives users' contextual state changes and enables intent-aware POI recommendation.<n>Extensive experiments on multiple real-world datasets demonstrate GeoGR's superiority over state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-11T01:48:27Z)
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization [53.080882980294795]
Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools.<n>In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses.<n>Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench.<n>We propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related
arXiv Detail & Related papers (2025-11-19T18:59:22Z)
GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation [57.8059956428009]
Recent attempts to transfer features from 2D Vision-Language Models to 3D semantic segmentation expose a persistent trade-off.<n>We propose GeoPurify that applies a small Student Affinity Network to 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model.<n>Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency.
arXiv Detail & Related papers (2025-10-02T16:37:56Z)
GRASP: Geospatial pixel Reasoning viA Structured Policy learning [16.023628299873494]
GRASP is a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner.<n> PRIME is a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives.<n>We release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks.
arXiv Detail & Related papers (2025-08-23T18:05:06Z)
ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search [53.40810298627443]
ReGUIDE is a framework for web grounding that enables MLLMs to learn data efficiently through self-generated reasoning and spatial-aware criticism.<n>Our experiments demonstrate that ReGUIDE significantly advances web grounding performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-21T08:36:18Z)
Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models [0.5242869847419834]
This study introduces a framework to construct such a knowledge base, leveraging geospatial script semantics. An example knowledge base, Geo-FuB, built from 154,075 Google Earth Engine scripts, is available on GitHub.
arXiv Detail & Related papers (2024-10-28T12:50:27Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
GeoDTR+: Toward generic cross-view geolocalization via geometric disentanglement [20.346145927174373]
Cross-View Geo-Localization (CVGL) estimates the location of a ground image by matching it to a geo-tagged aerial image in a database. Existing methods still suffer from poor performance in cross-area evaluation, in which the training and testing data are captured from completely distinct areas. We attribute this deficiency to the lack of ability to extract the geometric layout of visual features and models' overfitting to low-level details. In this work, we propose GeoDTR+ with an enhanced GLE module that better models the correlations among visual features.
arXiv Detail & Related papers (2023-08-18T15:32:01Z)
GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT [6.618846295332767]
Decision-makers in GIS need to combine a series of spatial algorithms and operations to solve geospatial tasks. We develop a new framework called GeoGPT that can conduct geospatial data collection, processing, and analysis in an autonomous manner.
arXiv Detail & Related papers (2023-07-16T03:03:59Z)
GNN-Geo: A Graph Neural Network-based Fine-grained IP geolocation Framework [26.918369615549803]
Rule-based fine-grained IP geolocation methods are hard to generalize in computer networks. We propose a Graph Neural Network (GNN)-based IP geolocation framework named GNN-Geo. The proposed GNN-Geo clearly outperforms the state-of-art rule-based and learning-based baselines.
arXiv Detail & Related papers (2021-12-18T10:54:31Z)
Local Augmentation for Graph Neural Networks [78.48812244668017]
We introduce the local augmentation, which enhances node features by its local subgraph structures. Based on the local augmentation, we further design a novel framework: LA-GNN, which can apply to any GNN models in a plug-and-play manner.
arXiv Detail & Related papers (2021-09-08T18:10:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.