Dual-View Visual Contextualization for Web Navigation
- URL: http://arxiv.org/abs/2402.04476v2
- Date: Sat, 30 Mar 2024 05:18:05 GMT
- Title: Dual-View Visual Contextualization for Web Navigation
- Authors: Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao,
- Abstract summary: We propose to contextualize HTML elements through their "dual views" in webpage screenshots.
We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences.
The resulting representations of HTML elements are more informative for the agent to take action.
- Score: 36.41910428196889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, making it hard to select the right (sequence of) actions. In this paper, we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences -- and propose to contextualize each element with its neighbor elements, using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios, including cross-task, cross-website, and cross-domain ones.
Related papers
- Infogent: An Agent-Based Framework for Web Information Aggregation [59.67710556177564]
We introduce Infogent, a novel framework for web information aggregation.
Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7%.
arXiv Detail & Related papers (2024-10-24T18:01:28Z) - Harnessing Webpage UIs for Text-Rich Visual Understanding [112.01029887404296]
We propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs)
These instructions are then paired with UI screenshots to train multimodal models.
We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.
arXiv Detail & Related papers (2024-10-17T17:48:54Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.