Dual-View Visual Contextualization for Web Navigation
- URL: http://arxiv.org/abs/2402.04476v2
- Date: Sat, 30 Mar 2024 05:18:05 GMT
- Title: Dual-View Visual Contextualization for Web Navigation
- Authors: Jihyung Kil, Chan Hee Song, Boyuan Zheng, Xiang Deng, Yu Su, Wei-Lun Chao,
- Abstract summary: We propose to contextualize HTML elements through their "dual views" in webpage screenshots.
We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences.
The resulting representations of HTML elements are more informative for the agent to take action.
- Score: 36.41910428196889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, making it hard to select the right (sequence of) actions. In this paper, we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences -- and propose to contextualize each element with its neighbor elements, using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios, including cross-task, cross-website, and cross-domain ones.
Related papers
- WebRPG: Automatic Web Rendering Parameters Generation for Visual Presentation [24.99791278208309]
We introduce Web Rendering Parameters Generation (WebRPG), a new task that aims at automating the generation for visual presentation of web pages based on their HTML code.
We present baseline models, utilizing VAE to manage numerous elements and rendering parameters, along with custom HTML embedding for capturing essential semantic and hierarchical information from HTML.
arXiv Detail & Related papers (2024-07-22T09:35:43Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - Towards Improving Document Understanding: An Exploration on
Text-Grounding via MLLMs [96.54224331778195]
We present a text-grounding document understanding model, termed TGDoc, which enhances MLLMs with the ability to discern the spatial positioning of text within images.
We formulate instruction tuning tasks including text detection, recognition, and spotting to facilitate the cohesive alignment between the visual encoder and large language model.
Our method achieves state-of-the-art performance across multiple text-rich benchmarks, validating the effectiveness of our method.
arXiv Detail & Related papers (2023-11-22T06:46:37Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.