HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of
Document Structures
- URL: http://arxiv.org/abs/2303.13839v1
- Date: Fri, 24 Mar 2023 07:23:56 GMT
- Title: HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of
Document Structures
- Authors: Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui
Zhu, Cong Liu
- Abstract summary: This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields.
We built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units.
We propose an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem.
- Score: 31.868926876151342
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The problem of document structure reconstruction refers to converting digital
or scanned documents into corresponding semantic structures. Most existing
works mainly focus on splitting the boundary of each element in a single
document page, neglecting the reconstruction of semantic structure in
multi-page documents. This paper introduces hierarchical reconstruction of
document structures as a novel task suitable for NLP and CV fields. To better
evaluate the system performance on the new task, we built a large-scale dataset
named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million
semantic units. Every document in HRDoc has line-level annotations including
categories and relations obtained from rule-based extractors and human
annotators. Moreover, we proposed an encoder-decoder-based hierarchical
document structure parsing system (DSPS) to tackle this problem. By adopting a
multi-modal bidirectional encoder and a structure-aware GRU decoder with
soft-mask operation, the DSPS model surpass the baseline method by a large
margin. All scripts and datasets will be made publicly available at
https://github.com/jfma-USTC/HRDoc.
Related papers
- Multi-Field Adaptive Retrieval [39.38972160512916]
We introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of document indices on structured data.
Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query.
We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured
arXiv Detail & Related papers (2024-10-26T03:07:22Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - HDT: Hierarchical Document Transformer [70.2271469410557]
HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy.
We develop a novel sparse attention kernel that considers the hierarchical structure of documents.
arXiv Detail & Related papers (2024-07-11T09:28:04Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis [9.340346869932434]
We propose a tree construction based approach that addresses multiple subtasks concurrently.
We present an effective end-to-end solution based on this framework to demonstrate its performance.
Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets.
arXiv Detail & Related papers (2024-01-22T12:00:37Z) - DSG: An End-to-End Document Structure Generator [32.040520771901996]
Document Structure Generator (DSG) is a novel system for document parsing that is fully end-to-end trainable.
Our results demonstrate that our DSG outperforms commercial OCR tools and, on top of that, achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-13T14:03:01Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - TransDocAnalyser: A Framework for Offline Semi-structured Handwritten
Document Analysis in the Legal Domain [3.5018563401895455]
We build the first semi-structured document analysis dataset in the legal domain.
This dataset combines a wide variety of handwritten text with printed text.
We propose an end-to-end framework for offline processing of handwritten semi-structured documents.
arXiv Detail & Related papers (2023-06-03T15:56:30Z) - Multilevel Text Alignment with Cross-Document Attention [59.76351805607481]
Existing alignment methods operate at a single, predefined level.
We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component.
arXiv Detail & Related papers (2020-10-03T02:52:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.