HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM
- URL: http://arxiv.org/abs/2409.19445v1
- Date: Sat, 28 Sep 2024 19:58:29 GMT
- Title: HTML-LSTM: Information Extraction from HTML Tables in Web Pages using Tree-Structured LSTM
- Authors: Kazuki Kawamura, Akihiro Yamamoto,
- Abstract summary: We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages.
The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data.
- Score: 1.104960878651584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.
Related papers
- SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning [48.376164461507244]
We introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework.<n>Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages.<n> Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o.
arXiv Detail & Related papers (2025-10-02T09:27:15Z) - HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems [62.36019283532854]
Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities.
RAG uses HTML instead of plain text as the format of retrieved knowledge.
We propose HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing the loss of information.
arXiv Detail & Related papers (2024-11-05T09:58:36Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - TSR-DSAW: Table Structure Recognition via Deep Spatial Association of
Words [20.59970119209079]
We propose to train a deep network to capture the spatial associations between different word pairs present in the table image for unravelling the table structure.
We present an end-to-end pipeline, named TSR-DSAW: TSR via Deep Spatial Association of Words, which outputs a digital representation of a table image in a structured format such as HTML.
arXiv Detail & Related papers (2022-03-14T06:02:28Z) - Modelling the semantics of text in complex document layouts using graph
transformer networks [0.0]
We propose a model that approximates the human reading pattern of a document and outputs a unique semantic representation for every text span.
We base our architecture on a graph representation of the structured text, and we demonstrate that not only can we retrieve semantically similar information across documents but also that the embedding space we generate captures useful semantic information.
arXiv Detail & Related papers (2022-02-18T11:49:06Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - DOM-LM: Learning Generalizable Representations for HTML Documents [33.742833774918786]
We introduce a novel representation learning approach for web pages, dubbed DOM-LM, which addresses the limitations of existing approaches.
We evaluate DOM-LM on a variety of webpage understanding tasks, including Attribute Extraction, Open Information Extraction, and Question Answering.
arXiv Detail & Related papers (2022-01-25T20:10:32Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.