EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
- URL: http://arxiv.org/abs/2410.19461v2
- Date: Sat, 02 Nov 2024 08:54:21 GMT
- Title: EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
- Authors: Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang,
- Abstract summary: This paper aims to enhance the GUI understanding and interacting capabilities of large vision-language models (LVLMs) through a data-driven approach.
We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web.
Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work.
- Score: 15.801018643716437
- License:
- Abstract: Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks demonstrate that the model trained with the dataset generated through EDGE exhibits superior webpage understanding capabilities, which can then be easily transferred to previously unseen desktop and mobile environments. Our approach significantly reduces the dependence on manual annotations, empowering researchers to harness the vast public resources available on the Web to advance their work. Our source code, the dataset and the model are available at https://anonymous.4open.science/r/EDGE-1CDB.
Related papers
- Graph Learning in the Era of LLMs: A Survey from the Perspective of Data, Models, and Tasks [25.720233631885726]
integration of Graph Neural Networks (GNNs) and Large Language Models (LLMs) has emerged as a promising technological paradigm.
We leverage graph description texts with rich semantic context to fundamentally enhance Data quality.
This work serves as a foundational reference for researchers and practitioners looking to advance graph learning methodologies.
arXiv Detail & Related papers (2024-12-17T01:41:17Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)
Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.
These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction [69.57190742976091]
We introduce Aguvis, a unified vision-based framework for autonomous GUI agents.
Our approach leverages image-based observations, and grounding instructions in natural language to visual elements.
To address the limitations of previous work, we integrate explicit planning and reasoning within the model.
arXiv Detail & Related papers (2024-12-05T18:58:26Z) - Large Language Model-Brained GUI Agents: A Survey [42.82362907348966]
multimodal models have ushered in a new era of GUI automation.
They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing.
These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands.
arXiv Detail & Related papers (2024-11-27T12:13:39Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Exploring Large Language Model for Graph Data Understanding in Online
Job Recommendations [63.19448893196642]
We present a novel framework that harnesses the rich contextual information and semantic representations provided by large language models to analyze behavior graphs.
By leveraging this capability, our framework enables personalized and accurate job recommendations for individual users.
arXiv Detail & Related papers (2023-07-10T11:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.