Multimodal Tree Decoder for Table of Contents Extraction in Document
Images
- URL: http://arxiv.org/abs/2212.02896v1
- Date: Tue, 6 Dec 2022 11:38:31 GMT
- Title: Multimodal Tree Decoder for Table of Contents Extraction in Document
Images
- Authors: Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Jun Du, Jiajia Wu
- Abstract summary: Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents.
We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels.
We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
- Score: 32.46909366312659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Table of contents (ToC) extraction aims to extract headings of different
levels in documents to better understand the outline of the contents, which can
be widely used for document understanding and information retrieval. Existing
works often use hand-crafted features and predefined rule-based functions to
detect headings and resolve the hierarchical relationship between headings.
Both the benchmark and research based on deep learning are still limited.
Accordingly, in this paper, we first introduce a standard dataset, HierDoc,
including image samples from 650 documents of scientific papers with their
content labels. Then we propose a novel end-to-end model by using the
multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc. The MTD model
is mainly composed of three parts, namely encoder, classifier, and decoder. The
encoder fuses the multimodality features of vision, text, and layout
information for each entity of the document. Then the classifier recognizes and
selects the heading entities. Next, to parse the hierarchical relationship
between the heading entities, a tree-structured decoder is designed. To
evaluate the performance, both the metric of tree-edit-distance similarity
(TEDS) and F1-Measure are adopted. Finally, our MTD approach achieves an
average TEDS of 87.2% and an average F1-Measure of 88.1% on the test set of
HierDoc. The code and dataset will be released at:
https://github.com/Pengfei-Hu/MTD.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - A Scalable Framework for Table of Contents Extraction from Complex ESG
Annual Reports [19.669390380593843]
We propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022.
These reports pose significant challenges due to their diverse structures and extensive length.
We propose a new framework for Toc extraction, consisting of three steps.
arXiv Detail & Related papers (2023-10-27T11:40:32Z) - Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents [8.49076413640561]
The model is pre-trained and subsequently fine-tuned for various document image analysis tasks.
The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification.
arXiv Detail & Related papers (2023-10-25T10:22:30Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of
Document Structures [31.868926876151342]
This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields.
We built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units.
We propose an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem.
arXiv Detail & Related papers (2023-03-24T07:23:56Z) - Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics [0.6767885381740952]
We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
arXiv Detail & Related papers (2022-10-12T08:57:01Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.