Extracting Variable-Depth Logical Document Hierarchy from Long
Documents: Method, Evaluation, and Application
- URL: http://arxiv.org/abs/2105.09297v1
- Date: Fri, 14 May 2021 06:26:22 GMT
- Title: Extracting Variable-Depth Logical Document Hierarchy from Long
Documents: Method, Evaluation, and Application
- Authors: Rongyu Cao and Yixuan Cao and Ganbin Zhou and Ping Luo
- Abstract summary: We develop a framework, namely Hierarchy Extraction from Long Document (HELD), where we "sequentially" insert each physical object at the proper on of the current tree.
Experiments based on thousands of long documents from Chinese, English financial market and English scientific publication.
We show that logical document hierarchy can be employed to significantly improve the performance of the downstream passage retrieval task.
- Score: 21.270184491603864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of extracting variable-depth "logical
document hierarchy" from long documents, namely organizing the recognized
"physical document objects" into hierarchical structures. The discovery of
logical document hierarchy is the vital step to support many downstream
applications. However, long documents, containing hundreds or even thousands of
pages and variable-depth hierarchy, challenge the existing methods. To address
these challenges, we develop a framework, namely Hierarchy Extraction from Long
Document (HELD), where we "sequentially" insert each physical object at the
proper on of the current tree. Determining whether each possible position is
proper or not can be formulated as a binary classification problem. To further
improve its effectiveness and efficiency, we study the design variants in HELD,
including traversal orders of the insertion positions, heading extraction
explicitly or implicitly, tolerance to insertion errors in predecessor steps,
and so on. The empirical experiments based on thousands of long documents from
Chinese, English financial market and English scientific publication show that
the HELD model with the "root-to-leaf" traversal order and explicit heading
extraction is the best choice to achieve the tradeoff between effectiveness and
efficiency with the accuracy of 0.9726, 0.7291 and 0.9578 in Chinese financial,
English financial and arXiv datasets, respectively. Finally, we show that
logical document hierarchy can be employed to significantly improve the
performance of the downstream passage retrieval task. In summary, we conduct a
systematic study on this task in terms of methods, evaluations, and
applications.
Related papers
- HDT: Hierarchical Document Transformer [70.2271469410557]
HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy.
We develop a novel sparse attention kernel that considers the hierarchical structure of documents.
arXiv Detail & Related papers (2024-07-11T09:28:04Z) - Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis [9.340346869932434]
We propose a tree construction based approach that addresses multiple subtasks concurrently.
We present an effective end-to-end solution based on this framework to demonstrate its performance.
Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets.
arXiv Detail & Related papers (2024-01-22T12:00:37Z) - Unveiling Document Structures with YOLOv5 Layout Detection [0.0]
This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data.
The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data.
arXiv Detail & Related papers (2023-09-29T07:45:10Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - CED: Catalog Extraction from Documents [12.037861186708799]
We propose a transition-based framework for parsing documents into catalog trees.
We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents.
arXiv Detail & Related papers (2023-04-28T07:32:00Z) - Fine-Grained Distillation for Long Document Retrieval [86.39802110609062]
Long document retrieval aims to fetch query-relevant documents from a large-scale collection.
Knowledge distillation has become de facto to improve a retriever by mimicking a heterogeneous yet powerful cross-encoder.
We propose a new learning framework, fine-grained distillation (FGD), for long-document retrievers.
arXiv Detail & Related papers (2022-12-20T17:00:36Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.