OmniParser for Pure Vision Based GUI Agent
- URL: http://arxiv.org/abs/2408.00203v1
- Date: Thu, 1 Aug 2024 00:00:43 GMT
- Title: OmniParser for Pure Vision Based GUI Agent
- Authors: Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah,
- Abstract summary: Power multimodal models like GPT-4V as a general agent on multiple operating systems are largely underestimated due to the lack of a robust screen parsing technique.
textsc Omni significantly improves GPT-4V's performance on ScreenSpot benchmark.
textsc Omni with screenshot only outperforms GPT-4V baselines requiring additional information outside of screenshot.
- Score: 37.911094082816504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first curated an interactable icon detection dataset using popular webpages and an icon description dataset. These datasets were utilized to fine-tune specialized models: a detection model to parse interactable regions on the screen and a caption model to extract the functional semantics of the detected elements. \textsc{OmniParser} significantly improves GPT-4V's performance on ScreenSpot benchmark. And on Mind2Web and AITW benchmark, \textsc{OmniParser} with screenshot only input outperforms the GPT-4V baselines requiring additional information outside of screenshot.
Related papers
- UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.
In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - Ponder & Press: Advancing Visual GUI Agent towards General Computer Control [13.39115823642937]
Ponder & Press is a divide-and-conquer framework for general computer control using only visual input.
Our agent offers a versatile, human-like interaction paradigm applicable to a wide range of applications.
arXiv Detail & Related papers (2024-12-02T08:35:31Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Grounded GUI Understanding for Vision Based Spatial Intelligent Agent: Exemplified by Virtual Reality Apps [41.601579396549404]
We propose the first zero-shot cOntext-sensitive inteRactable GUI ElemeNT dEtection framework for virtual Reality apps, named Orienter.
By imitating human behaviors, Orienter observes and understands the semantic contexts of VR app scenes first, before performing the detection.
arXiv Detail & Related papers (2024-09-17T00:58:00Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Task Automation [8.998467488526327]
This paper presents LlamaTouch, a testbed for on-device mobile UI task execution and faithful, scalable task evaluation.
LlamaTouch employs a novel evaluation approach that only assesses whether an agent traverses all manually annotated, essential application/system states.
LlamaTouch also enables easy task annotation and integration of new mobile agents.
arXiv Detail & Related papers (2024-04-12T15:39:09Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [103.68138147783614]
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models.
We employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions, and overlay these regions with a set of marks.
Using the marked image as input, GPT-4V can answer the questions that require visual grounding.
arXiv Detail & Related papers (2023-10-17T17:51:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.