Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
- URL: http://arxiv.org/abs/2602.06351v1
- Date: Fri, 06 Feb 2026 03:27:55 GMT
- Title: Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion
- Authors: Longhui Ma, Di Zhao, Siwei Wang, Zhao Lv, Miao Wang,
- Abstract summary: Existing approaches rely on fine-tuning multimodal large language models to predict target element coordinates.<n>Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning.<n>We propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors.
- Score: 20.165689356521295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: GUI grounding maps natural language instructions to the correct interface elements, serving as the perception foundation for GUI agents. Existing approaches predominantly rely on fine-tuning multimodal large language models (MLLMs) using large-scale GUI datasets to predict target element coordinates, which is data-intensive and generalizes poorly to unseen interfaces. Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning, but suffer from low reliability due to the lack of explicit and complementary spatial anchors in GUI images. To address this limitation, we propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors. Trifuse integrates attention, OCR-derived textual cues, and icon-level caption semantics via a Consensus-SinglePeak (CS) fusion strategy that enforces cross-modal agreement while retaining sharp localization peaks. Extensive evaluations on four grounding benchmarks demonstrate that Trifuse achieves strong performance without task-specific fine-tuning, substantially reducing the reliance on expensive annotated data. Moreover, ablation studies reveal that incorporating OCR and caption cues consistently improves attention-based grounding performance across different backbones, highlighting its effectiveness as a general framework for GUI grounding.
Related papers
- LoGoSeg: Integrating Local and Global Features for Open-Vocabulary Semantic Segmentation [12.192429756057132]
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories.<n>LoGoSeg integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context.
arXiv Detail & Related papers (2026-02-05T12:03:11Z) - RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction [9.311605679381529]
We propose LOC, a general language-guided framework adaptable to various occupancy networks.<n>For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN)<n>Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity.
arXiv Detail & Related papers (2025-10-25T03:27:19Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning [53.42606072841585]
We introduce DiMo-GUI, a training-free framework for GUI grounding.<n>Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements.<n>When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions.
arXiv Detail & Related papers (2025-06-12T03:13:21Z) - EgoSplat: Open-Vocabulary Egocentric Scene Understanding with Language Embedded 3D Gaussian Splatting [108.15136508964011]
EgoSplat is a language-embedded 3D Gaussian Splatting framework for open-vocabulary egocentric scene understanding.<n>EgoSplat achieves state-of-the-art performance in both localization and segmentation tasks on two datasets.
arXiv Detail & Related papers (2025-03-14T12:21:26Z) - Point Cloud Understanding via Attention-Driven Contrastive Learning [64.65145700121442]
Transformer-based models have advanced point cloud understanding by leveraging self-attention mechanisms.
PointACL is an attention-driven contrastive learning framework designed to address these limitations.
Our method employs an attention-driven dynamic masking strategy that guides the model to focus on under-attended regions.
arXiv Detail & Related papers (2024-11-22T05:41:00Z) - Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images [24.06927394483275]
We propose a stronger multifaceted collaborative salient object detector in ORSIs, termed LBA-MCNet.
The network focuses on accurately locating targets, balancing detailed features, and modeling image-level global context information.
arXiv Detail & Related papers (2024-10-31T14:50:48Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.