DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
- URL: http://arxiv.org/abs/2507.00008v1
- Date: Thu, 12 Jun 2025 03:13:21 GMT
- Title: DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning
- Authors: Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, Yiwei Wang,
- Abstract summary: We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
- Score: 52.37530640460363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: dynamic visual grounding and modality-aware optimization. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models. When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.
Related papers
- Test-Time Reinforcement Learning for GUI Grounding via Region Consistency [17.954613936413942]
We propose a test-time scaling method that constructs spatial voting grids from multiple sampled predictions to identify consensus regions.<n>We also introduce GUI-RCPO, which transforms these consistency patterns into rewards for test-time reinforcement learning.<n>Our approach reveals the untapped potential of test-time scaling and test-time reinforcement learning for GUI grounding, offering a promising path toward more robust and data-efficient GUI agents.
arXiv Detail & Related papers (2025-08-07T17:54:27Z) - R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding [18.100091500983044]
A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms.<n>Existing vision-only GUI agents directly ground elements from large and cluttered screenshots.<n>We introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization.
arXiv Detail & Related papers (2025-07-08T04:56:57Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - Visual Test-time Scaling for GUI Agent Grounding [61.609126885427386]
We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents.<n>Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy.<n>We observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks.
arXiv Detail & Related papers (2025-05-01T17:45:59Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
Aguvis is a vision-based framework for autonomous GUI agents.<n>It standardizes cross-platform interactions and incorporates structured reasoning via inner monologue.<n>It achieves state-of-the-art performance across offline and real-world online benchmarks.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Improved GUI Grounding via Iterative Narrowing [0.03922370499388702]
We introduce a visual prompting framework that employs an iterative narrowing mechanism to improve the performance of both general and fine-tuned models in GUI grounding.<n>For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
arXiv Detail & Related papers (2024-11-18T05:47:12Z) - Multi-Granularity Language-Guided Training for Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.<n>At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.<n>Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces [27.84098739594353]
Graph4GUI exploits graph neural networks to capture individual elements' properties and semantic-visuo-spatial constraints in a layout.
The learned representation demonstrated its effectiveness in multiple tasks, especially generating designs in a challenging GUI autocompletion task.
arXiv Detail & Related papers (2024-04-21T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.