Multimodal Tree Decoder for Table of Contents Extraction in Document
Images
- URL: http://arxiv.org/abs/2212.02896v1
- Date: Tue, 6 Dec 2022 11:38:31 GMT
- Title: Multimodal Tree Decoder for Table of Contents Extraction in Document
Images
- Authors: Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Jun Du, Jiajia Wu
- Abstract summary: Table of contents (ToC) extraction aims to extract headings of different levels in documents to better understand the outline of the contents.
We first introduce a standard dataset, HierDoc, including image samples from 650 documents of scientific papers with their content labels.
We propose a novel end-to-end model by using the multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc.
- Score: 32.46909366312659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Table of contents (ToC) extraction aims to extract headings of different
levels in documents to better understand the outline of the contents, which can
be widely used for document understanding and information retrieval. Existing
works often use hand-crafted features and predefined rule-based functions to
detect headings and resolve the hierarchical relationship between headings.
Both the benchmark and research based on deep learning are still limited.
Accordingly, in this paper, we first introduce a standard dataset, HierDoc,
including image samples from 650 documents of scientific papers with their
content labels. Then we propose a novel end-to-end model by using the
multimodal tree decoder (MTD) for ToC as a benchmark for HierDoc. The MTD model
is mainly composed of three parts, namely encoder, classifier, and decoder. The
encoder fuses the multimodality features of vision, text, and layout
information for each entity of the document. Then the classifier recognizes and
selects the heading entities. Next, to parse the hierarchical relationship
between the heading entities, a tree-structured decoder is designed. To
evaluate the performance, both the metric of tree-edit-distance similarity
(TEDS) and F1-Measure are adopted. Finally, our MTD approach achieves an
average TEDS of 87.2% and an average F1-Measure of 88.1% on the test set of
HierDoc. The code and dataset will be released at:
https://github.com/Pengfei-Hu/MTD.
Related papers
- ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval [64.44265315244579]
We propose a tree-based method for organizing and representing reference documents at various granular levels.
Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches.
Our evaluations show that ReTreever generally preserves full representation accuracy.
arXiv Detail & Related papers (2025-02-11T21:35:13Z) - MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents [26.39534684408116]
This work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval.
The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions.
arXiv Detail & Related papers (2025-01-15T14:30:13Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents [8.49076413640561]
The model is pre-trained and subsequently fine-tuned for various document image analysis tasks.
The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification.
arXiv Detail & Related papers (2023-10-25T10:22:30Z) - HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of
Document Structures [31.868926876151342]
This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields.
We built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units.
We propose an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem.
arXiv Detail & Related papers (2023-03-24T07:23:56Z) - Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval
on Predefined Topics [0.6767885381740952]
We introduce a method that learns jointly embedded document and word vectors solely from the unlabeled document dataset.
The proposed method requires almost no text preprocessing but is simultaneously effective at retrieving relevant documents with high probability.
For easy replication of our approach, we make the developed Lbl2Vec code publicly available as a ready-to-use tool under the 3-Clause BSD license.
arXiv Detail & Related papers (2022-10-12T08:57:01Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - Pairwise Multi-Class Document Classification for Semantic Relations
between Wikipedia Articles [5.40541521227338]
We model the problem of finding the relationship between two documents as a pairwise document classification task.
To find semantic relation between documents, we apply a series of techniques, such as GloVe, paragraph-s, BERT, and XLNet.
We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations.
arXiv Detail & Related papers (2020-03-22T12:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.