CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents
- URL: http://arxiv.org/abs/2506.03822v1
- Date: Wed, 04 Jun 2025 10:52:55 GMT
- Title: CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents
- Authors: Fabian Karl, Ansgar Scherp,
- Abstract summary: CRAWLDoc is a new method for contextual ranking of linked web documents.<n>It retrieves the landing page and all linked web resources, including PDFs, profiles, and supplementary materials.<n>It embeds these resources, along with anchor texts and the URLs, into a unified representation.
- Score: 3.3916160303055563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.
Related papers
- Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records.<n>To address this gap, we created a large-scale, open-access dataset specifically designed for list pages.<n>Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z) - Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models [11.959445364035734]
80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats.
We introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous documents.
Docs2KG generates a unified knowledge graph that represents the extracted key information.
arXiv Detail & Related papers (2024-06-05T05:35:59Z) - BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents.
Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - CED: Catalog Extraction from Documents [12.037861186708799]
We propose a transition-based framework for parsing documents into catalog trees.
We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents.
arXiv Detail & Related papers (2023-04-28T07:32:00Z) - SIMARA: a database for key-value information extraction from full pages [0.1835211348413763]
We propose a new database for information extraction from historical handwritten documents.
The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries.
Finding aids are handwritten documents that contain metadata describing older archives.
arXiv Detail & Related papers (2023-04-26T15:00:04Z) - DocOIE: A Document-level Context-Aware Dataset for OpenIE [22.544165148622422]
Open Information Extraction (OpenIE) aims to extract structured relationals from sentences.
Existing solutions perform extraction at sentence level, without referring to any additional contextual information.
We propose DocIE, a novel document-level context-aware OpenIE model.
arXiv Detail & Related papers (2021-05-10T11:14:30Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Cross-Domain Document Object Detection: Benchmark Suite and Method [71.4339949510586]
Document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding.
We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain.
For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files.
arXiv Detail & Related papers (2020-03-30T03:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.