Identifying Documents In-Scope of a Collection from Web Archives
- URL: http://arxiv.org/abs/2009.00611v1
- Date: Wed, 2 Sep 2020 16:22:23 GMT
- Title: Identifying Documents In-Scope of a Collection from Web Archives
- Authors: Krutarth Patel, Cornelia Caragea, Mark Phillips, Nathaniel Fox
- Abstract summary: We study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document.
We focus our evaluation on three datasets that we created from three different Web archives.
Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
- Score: 37.34941845795198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Web archive data usually contains high-quality documents that are very useful
for creating specialized collections of documents, e.g., scientific digital
libraries and repositories of technical reports. In doing so, there is a
substantial need for automatic approaches that can distinguish the documents of
interest for a collection out of the huge number of documents collected by web
archiving institutions. In this paper, we explore different learning models and
feature representations to determine the best performing ones for identifying
the documents of interest from the web archived data. Specifically, we study
both machine learning and deep learning models and "bag of words" (BoW)
features extracted from the entire document or from specific portions of the
document, as well as structural features that capture the structure of
documents. We focus our evaluation on three datasets that we created from three
different Web archives. Our experimental results show that the BoW classifiers
that focus only on specific portions of the documents (rather than the full
text) outperform all compared methods on all three datasets.
Related papers
- DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Evaluation of a Region Proposal Architecture for Multi-task Document
Layout Analysis [0.685316573653194]
Mask-RCNN architecture is designed to address the problem of baseline detection and region segmentation.
We present experimental results on two handwritten text datasets and one handwritten music dataset.
The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.
arXiv Detail & Related papers (2021-06-22T14:07:27Z) - docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents.
We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets.
We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.