Related papers: DOM-LM: Learning Generalizable Representations for HTML Documents

DOM-LM: Learning Generalizable Representations for HTML Documents

URL: http://arxiv.org/abs/2201.10608v1
Date: Tue, 25 Jan 2022 20:10:32 GMT
Title: DOM-LM: Learning Generalizable Representations for HTML Documents
Authors: Xiang Deng, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Huan Sun
Abstract summary: We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches. We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
Score: 33.742833774918786
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: HTML documents are an important medium for disseminating information on the Web for human consumption. An HTML document presents information in multiple text formats including unstructured text, structured key-value pairs, and tables. Effective representation of these documents is essential for machine understanding to enable a wide range of applications, such as Question Answering, Web Search, and Personalization. Existing work has either represented these documents using visual features extracted by rendering them in a browser, which is typically computationally expensive, or has simply treated them as plain text documents, thereby failing to capture useful information presented in their HTML structure. We argue that the text and HTML structure together convey important semantics of the content and therefore warrant a special treatment for their representation learning. In this paper, we introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches by encoding both text and DOM tree structure with a transformer-based encoder and learning generalizable representations for HTML documents via self-supervised pre-training. We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering. Our extensive experiments show that DOM-LM consistently outperforms all baselines designed for these tasks. In particular, DOM-LM demonstrates better generalization performance both in few-shot and zero-shot settings, making it attractive for making it suitable for real-world application settings with limited labeled data.

Related papers

Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents [7.358946120326249]
We introduce 'Eclair, a text-extraction tool specifically designed to process a wide range of document types. Given an image, 'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. 'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics.
arXiv Detail & Related papers (2025-02-06T17:07:22Z)
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems [62.36019283532854]
Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities. RAG uses HTML instead of plain text as the format of retrieved knowledge. We propose HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing the loss of information.
arXiv Detail & Related papers (2024-11-05T09:58:36Z)
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs [112.89665642941814]
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio. Current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. We propose a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning.
arXiv Detail & Related papers (2024-06-28T17:59:46Z)
Dual-View Visual Contextualization for Web Navigation [36.41910428196889]
We propose to contextualize HTML elements through their "dual views" in webpage screenshots. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences. The resulting representations of HTML elements are more informative for the agent to take action.
arXiv Detail & Related papers (2024-02-06T23:52:10Z)
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z)
Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks. We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z)
WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z)
MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding [35.35388421383703]
Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU) We propose MarkupLM for document understanding tasks with markup languages as the backbone. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks.
arXiv Detail & Related papers (2021-10-16T09:17:28Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.