Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
- URL: http://arxiv.org/abs/2412.10840v1
- Date: Sat, 14 Dec 2024 14:30:05 GMT
- Title: Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
- Authors: Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu,
- Abstract summary: We propose a tuning-free Attention-driven Grounding (TAG) method that leverages inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning.<n>Our method achieves performance comparable to tuning-based methods, with notable success in text localization.<n>We demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5.
- Score: 29.47233232259932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a state-of-the-art MLLM, our tuning-free approach achieves performance comparable to tuning-based methods, with notable success in text localization. Additionally, we demonstrate that our attention map-based grounding technique significantly outperforms direct localization predictions from MiniCPM-Llama3-V 2.5, highlighting the potential of using attention maps from pretrained MLLMs and paving the way for future innovations in this domain.
Related papers
- Trifuse: Enhancing Attention-Based GUI Grounding via Multimodal Fusion [20.165689356521295]
Existing approaches rely on fine-tuning multimodal large language models to predict target element coordinates.<n>Recent attention-based alternatives exploit localization signals in MLLMs attention mechanisms without task-specific fine-tuning.<n>We propose Trifuse, an attention-based grounding framework that explicitly integrates complementary spatial anchors.
arXiv Detail & Related papers (2026-02-06T03:27:55Z) - GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding [44.598660921968595]
We propose an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding.<n>Gui-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals.<n>It achieves state-of-the-art performance among 3B models, attaining an average accuracy of 59.6% on ScreenSpot-Pro, 63.8% on OSWorld-G and 91.5% on ScreenSpot-v2.
arXiv Detail & Related papers (2025-11-02T05:34:21Z) - How Auxiliary Reasoning Unleashes GUI Grounding in VLMs [16.798199078199154]
General vision-language models (VLMs) struggle with this task due to a lack of specific optimization.<n>We propose three zero-shot auxiliary reasoning methods to address this discrepancy.<n>We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs.
arXiv Detail & Related papers (2025-09-15T03:28:29Z) - Structuring GUI Elements through Vision Language Models: Towards Action Space Generation [43.932266242034025]
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction.<n>This paper focuses on the application of MLLMs in the field of graphical user interface (GUI) elements structuring.<n>We introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm to bolster visual module capabilities.
arXiv Detail & Related papers (2025-08-22T10:14:15Z) - Zoomer: Adaptive Image Focus Optimization for Black-box MLLM [45.40963536739482]
SysName is a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits.<n>SysName consistently outperforms baseline methods, achieving up to a $26.9%$ improvement in accuracy while significantly reducing token consumption.
arXiv Detail & Related papers (2025-04-30T02:51:10Z) - Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.
tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z) - Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs [7.03771340666549]
Vision-language misalignment in Multimodal Large Language Models (MLLMs) is a critical challenge.
We propose MapleLeaf AKI, a novel MLLM that unlocks causal attention into modality-mutual attention (MMA) to enable image tokens to attend to text tokens.
Our MMA design is intended to be generic, allowing for application across various modalities, and scalable to accommodate diverse multimodal scenarios.
arXiv Detail & Related papers (2025-03-04T13:18:33Z) - Enhance Graph Alignment for Large Language Models [33.96082485852042]
Graph-to-token approaches are popular in enabling Large Language Models to process graph information.
Existing methods have a misalignment between self-supervised tasks and supervised downstream tasks.
We propose Graph Alignment Large Language Models (GALLM) to benefit from aligned task templates.
arXiv Detail & Related papers (2024-10-15T07:50:34Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs)
We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them.
We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - LAMM: Label Alignment for Multi-Modal Prompt Learning [17.478967970736115]
We introduce an innovative label alignment method named textbfLAMM, which can adjust the category embeddings of downstream datasets through end-to-end training.
Our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios.
Our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods.
arXiv Detail & Related papers (2023-12-13T15:29:52Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.