PLAtE: A Large-scale Dataset for List Page Web Extraction
- URL: http://arxiv.org/abs/2205.12386v2
- Date: Thu, 15 Jun 2023 17:06:49 GMT
- Title: PLAtE: A Large-scale Dataset for List Page Web Extraction
- Authors: Aidan San, Yuan Zhuang, Jan Bakus, Colin Lockard, David Ciemiewicz,
Sandeep Atluri, Yangfeng Ji, Kevin Small, Heba Elfardy
- Abstract summary: PLAtE is composed of 52, 898 items collected from 6, 694 pages and 156, 014 attributes, making it the first largescale list page web extraction dataset.
We use a multi-stage approach to collect and annotate the dataset and adapt three state-of-the-art web extraction models to the two tasks comparing their strengths and weaknesses both quantitatively and qualitatively.
- Score: 19.92099953576541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, neural models have been leveraged to significantly improve the
performance of information extraction from semi-structured websites. However, a
barrier for continued progress is the small number of datasets large enough to
train these models. In this work, we introduce the PLAtE (Pages of Lists
Attribute Extraction) benchmark dataset as a challenging new web extraction
task. PLAtE focuses on shopping data, specifically extractions from product
review pages with multiple items encompassing the tasks of: (1) finding
product-list segmentation boundaries and (2) extracting attributes for each
product. PLAtE is composed of 52, 898 items collected from 6, 694 pages and
156, 014 attributes, making it the first largescale list page web extraction
dataset. We use a multi-stage approach to collect and annotate the dataset and
adapt three state-of-the-art web extraction models to the two tasks comparing
their strengths and weaknesses both quantitatively and qualitatively.
Related papers
- Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records.
To address this gap, we created a large-scale, open-access dataset specifically designed for list pages.
Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z) - Multilingual Attribute Extraction from News Web Pages [44.99833362998488]
This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages.
We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic).
We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality.
arXiv Detail & Related papers (2025-02-04T09:43:40Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Product Information Extraction using ChatGPT [69.12244027050454]
This paper explores the potential of ChatGPT for extracting attribute/value pairs from product descriptions.
Our results show that ChatGPT achieves a performance similar to a pre-trained language model but requires much smaller amounts of training data and computation for fine-tuning.
arXiv Detail & Related papers (2023-06-23T09:30:01Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Jointly Learning Span Extraction and Sequence Labeling for Information
Extraction from Business Documents [1.6249267147413522]
This paper introduces a new information extraction model for business documents.
It takes into account advantage of both span extraction and sequence labeling.
The model is trained end-to-end to jointly optimize the two tasks.
arXiv Detail & Related papers (2022-05-26T15:37:24Z) - WebFormer: The Web-page Transformer for Structure Information Extraction [44.46531405460861]
Structure information extraction refers to the task of extracting structured text fields from web pages.
Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction.
We introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents.
arXiv Detail & Related papers (2022-02-01T04:44:02Z) - The Klarna Product Page Dataset: Web Element Nomination with Graph
Neural Networks and Large Language Models [51.39011092347136]
We introduce the Klarna Product Page dataset, a collection of webpages that surpasses existing datasets in richness and variety.
We empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task.
Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page.
Third, we introduce the Challenge Nomination Training Procedure, a novel training approach that further boosts nomination accuracy.
arXiv Detail & Related papers (2021-11-03T12:13:52Z) - Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An
Approach [115.91099791629104]
We construct two new benchmark webly supervised fine-grained datasets, WebFG-496 and WebiNat-5089, respectively.
For WebiNat-5089, it contains 5089 sub-categories and more than 1.1 million web training images, which is the largest webly supervised fine-grained dataset ever.
As a minor contribution, we also propose a novel webly supervised method (termed Peer-learning'') for benchmarking these datasets.
arXiv Detail & Related papers (2021-08-05T06:28:32Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.