Robust PDF Document Conversion Using Recurrent Neural Networks
- URL: http://arxiv.org/abs/2102.09395v1
- Date: Thu, 18 Feb 2021 14:39:54 GMT
- Title: Robust PDF Document Conversion Using Recurrent Neural Networks
- Authors: Nikolaos Livathinos (1), Cesar Berrospi (1), Maksym Lysak (1), Viktor
Kuropiatnyk (1), Ahmed Nassar (1), Andre Carvalho (1), Michele Dolfi (1),
Christoph Auer (1), Kasper Dinkla (1), Peter Staar (1) ((1) IBM Research)
- Abstract summary: We present a novel approach to document structure recovery in PDF using recurrent neural networks.
We show how a sequence of PDF printing commands can be used as input into a neural network.
We implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The number of published PDF documents has increased exponentially in recent
decades. There is a growing need to make their rich content discoverable to
information retrieval tools. In this paper, we present a novel approach to
document structure recovery in PDF using recurrent neural networks to process
the low-level PDF data representation directly, instead of relying on a visual
re-interpretation of the rendered PDF page, as has been proposed in previous
literature. We demonstrate how a sequence of PDF printing commands can be used
as input into a neural network and how the network can learn to classify each
printing command according to its structural function in the page. This
approach has three advantages: First, it can distinguish among more
fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual
methods), which results in a more accurate and detailed document structure
resolution. Second, it can take into account the text flow across pages more
naturally compared to visual methods because it can concatenate the printing
commands of sequential pages. Last, our proposed method needs less memory and
it is computationally less expensive than visual methods. This allows us to
deploy such models in production environments at a much lower cost. Through
extensive architectural search in combination with advanced feature
engineering, we were able to implement a model that yields a weighted average
F1 score of 97% across 17 distinct structural labels. The best model we
achieved is currently served in production environments on our Corpus
Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This
model enhances the capabilities of CCS significantly, as it eliminates the need
for human annotated label ground-truth for every unseen document layout. This
proved particularly useful when applied to a huge corpus of PDF articles
related to COVID-19.
Related papers
- Focus Anywhere for Fine-grained Multi-page Document Understanding [24.76897786595502]
This paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages.
We render cross-vocabulary vision data as the foreground to achieve a full reaction of multiple visual vocabularies and in-document figure understanding.
arXiv Detail & Related papers (2024-05-23T08:15:49Z) - Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z) - GRAM: Global Reasoning for Multi-Page VQA [14.980413646626234]
We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
arXiv Detail & Related papers (2024-01-07T08:03:06Z) - CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image
Generation [139.8037697822064]
We present a non-parametric structured latent variable model for image generation, called NP-DRAW.
It sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas.
arXiv Detail & Related papers (2021-06-25T05:17:55Z) - Document Domain Randomization for Deep Learning Document Layout
Extraction [37.97092983885967]
We present document domain randomization (DDR), the first successful transfer of convolutional neural networks (CNNs) trained only on graphically rendered pseudo-paper pages.
DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest.
We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy.
arXiv Detail & Related papers (2021-05-20T19:16:04Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.