Improving GUI Grounding with Explicit Position-to-Coordinate Mapping
- URL: http://arxiv.org/abs/2510.03230v1
- Date: Fri, 03 Oct 2025 17:59:34 GMT
- Title: Improving GUI Grounding with Explicit Position-to-Coordinate Mapping
- Authors: Suyuchen Wang, Tianyu Zhang, Ahmed Masry, Christopher Pal, Spandana Gella, Bang Liu, Perouz Taslakian,
- Abstract summary: Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly.<n>We propose RULER tokens as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch.<n>Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces.
- Score: 40.201918480639954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.
Related papers
- Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation [52.01210390327581]
We propose Guided Attentive Interpolation (GAI) to adaptively interpolate fine-grained high-resolution features with semantic features.<n>GAI determines both spatial and semantic relations of pixels from features of different resolutions and then leverages these relations to interpolate high-resolution features with rich semantics.<n>In experiments, the GAI-based semantic segmentation networks, i.e., GAIN, can achieve78.8 mIoU with 22.3 FPS on Cityscapes and 80.6 mIoU with 64.5 on CamVid using an NVIDIA 1080Ti GPU.
arXiv Detail & Related papers (2026-01-03T12:09:49Z) - Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization [8.559240391514063]
Cross-view object geo-localization enables high-precision object localization through cross-view matching.<n>Existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information.<n>We propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes.<n>We present EDGeo, an end-to-end framework for robust cross-view object geo-localization.
arXiv Detail & Related papers (2025-10-23T06:07:07Z) - Learning GUI Grounding with Spatial Reasoning from Visual Feedback [46.66862168972301]
We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function.<n>Our experimental results show that GUI-Cursor, based on Qwen2.5-VL-7B, improves the GUI grounding accuracy and achieves state-of-the-art results.
arXiv Detail & Related papers (2025-09-25T20:38:01Z) - Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization [1.2031796234206136]
We introduce a framework we call Semantic-Enhanced Cross-Modal Place Recognition (SCM-PR)<n>SCM-PR combines high-level semantics utilizing RGB images for robust localization in LiDAR maps.<n>Our experimental work on the KITTI and KITTI-360 datasets show that SCM-PR achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-16T19:17:54Z) - GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding [51.497245303008015]
Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction.<n>Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards.<n>We show that GUI-G$2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro.
arXiv Detail & Related papers (2025-07-21T17:53:42Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer [51.260384040953326]
Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
arXiv Detail & Related papers (2024-07-10T15:42:58Z) - Differentiable Registration of Images and LiDAR Point Clouds with
VoxelPoint-to-Pixel Matching [58.10418136917358]
Cross-modality registration between 2D images from cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic training.
Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks.
We learn a structured cross-modality matching solver to represent 3D features via a different latent pixel space.
arXiv Detail & Related papers (2023-12-07T05:46:10Z) - IDLS: Inverse Depth Line based Visual-Inertial SLAM [9.38589798999922]
Inverse Depth Line SLAM (IDLS) is proposed to track the line features in SLAM in an accurate and efficient way.
IDLS is extensively evaluated in multiple perceptually-challenging datasets.
arXiv Detail & Related papers (2023-04-23T20:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.