DOM-LM: Learning Generalizable Representations for HTML Documents
- URL: http://arxiv.org/abs/2201.10608v1
- Date: Tue, 25 Jan 2022 20:10:32 GMT
- Title: DOM-LM: Learning Generalizable Representations for HTML Documents
- Authors: Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun
- Abstract summary: We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
- Score: 33.742833774918786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: HTML documents are an important medium for disseminating information on the
Web for human consumption. An HTML document presents information in multiple
text formats including unstructured text, structured key-value pairs, and
tables. Effective representation of these documents is essential for machine
understanding to enable a wide range of applications, such as Question
Answering, Web Search, and Personalization. Existing work has either
represented these documents using visual features extracted by rendering them
in a browser, which is typically computationally expensive, or has simply
treated them as plain text documents, thereby failing to capture useful
information presented in their HTML structure. We argue that the text and HTML
structure together convey important semantics of the content and therefore
warrant a special treatment for their representation learning. In this paper,
we introduce a novel representation learning approach for web pages, dubbed
DOM-LM, which addresses the limitations of existing approaches by encoding both
text and DOM tree structure with a transformer-based encoder and learning
generalizable representations for HTML documents via self-supervised
pre-training. We evaluate DOM-LM on a variety of webpage understanding tasks,
including Attribute Extraction, Open Information Extraction, and Question
Answering. Our extensive experiments show that DOM-LM consistently outperforms
all baselines designed for these tasks. In particular, DOM-LM demonstrates
better generalization performance both in few-shot and zero-shot settings,
making it attractive for making it suitable for real-world application settings
with limited labeled data.
Related papers
- HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems [62.36019283532854]
Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities.
RAG uses HTML instead of plain text as the format of retrieved knowledge.
We propose HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing the loss of information.
arXiv Detail & Related papers (2024-11-05T09:58:36Z) - Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio.
Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code.
We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z) - Dual-View Visual Contextualization for Web Navigation [36.41910428196889]
We propose to contextualize HTML elements through their "dual views" in webpage screenshots.
We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences.
The resulting representations of HTML elements are more informative for the agent to take action.
arXiv Detail & Related papers (2024-02-06T23:52:10Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - MarkupLM: Pre-training of Text and Markup Language for Visually-rich
Document Understanding [35.35388421383703]
Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU)
We propose MarkupLM for document understanding tasks with markup languages as the backbone.
Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
arXiv Detail & Related papers (2021-10-16T09:17:28Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.