Multi-Record Web Page Information Extraction From News Websites
- URL: http://arxiv.org/abs/2502.14625v1
- Date: Thu, 20 Feb 2025 15:05:00 GMT
- Title: Multi-Record Web Page Information Extraction From News Websites
- Authors: Alexander Kustenkov, Maksim Varlamov, Alexander Yatskov,
- Abstract summary: In this paper, we focus on the problem of extracting information from web pages containing many records.
To address this gap, we created a large-scale, open-access dataset specifically designed for list pages.
Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
- Score: 83.88591755871734
- License:
- Abstract: In this paper, we focused on the problem of extracting information from web pages containing many records, a task of growing importance in the era of massive web data. Recently, the development of neural network methods has improved the quality of information extraction from web pages. Nevertheless, most of the research and datasets are aimed at studying detailed pages. This has left multi-record "list pages" relatively understudied, despite their widespread presence and practical significance. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. This is the first dataset for this task in the Russian language. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity. Our dataset contains attributes of various types, including optional and multi-valued, providing a realistic representation of real-world list pages. These features make our dataset a valuable resource for studying information extraction from pages containing many records. Furthermore, we proposed our own multi-stage information extraction methods. In this work, we explore and demonstrate several strategies for applying MarkupLM to the specific challenges of multi-record web pages. Our experiments validate the advantages of our methods. By releasing our dataset to the public, we aim to advance the field of information extraction from multi-record pages.
Related papers
- Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages.
We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic).
We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z) - MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels [95.48844474720798]
We introduce MS MARCO Web Search, the first large-scale information-rich web dataset.
This dataset mimics real-world web document and query distribution.
MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks.
arXiv Detail & Related papers (2024-05-13T07:46:44Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - PLAtE: A Large-scale Dataset for List Page Web Extraction [19.92099953576541]
PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset.
We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
arXiv Detail & Related papers (2022-05-24T22:26:58Z) - Web Page Content Extraction Based on Multi-feature Fusion [20.214440785390984]
This paper proposes a web page text extraction algorithm based on multi-feature fusion.
It takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, and adapts to more types of pages.
Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.
arXiv Detail & Related papers (2022-03-21T04:26:51Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - A Large Visual, Qualitative and Quantitative Dataset of Web Pages [4.5002924206836]
We have created a large dataset of 49,438 Web pages.
It consists of visual, textual and numerical data types, includes all countries worldwide, and considers a broad range of topics.
arXiv Detail & Related papers (2021-05-15T01:31:25Z) - A Graph Representation of Semi-structured Data for Web Question
Answering [96.46484690047491]
We propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations.
Our method improves F1 score by 3.90 points over the state-of-the-art baselines.
arXiv Detail & Related papers (2020-10-14T04:01:54Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.