Robust PDF Document Conversion Using Recurrent Neural Networks
- URL: http://arxiv.org/abs/2102.09395v1
- Date: Thu, 18 Feb 2021 14:39:54 GMT
- Title: Robust PDF Document Conversion Using Recurrent Neural Networks
- Authors: Nikolaos Livathinos (1), Cesar Berrospi (1), Maksym Lysak (1), Viktor
Kuropiatnyk (1), Ahmed Nassar (1), Andre Carvalho (1), Michele Dolfi (1),
Christoph Auer (1), Kasper Dinkla (1), Peter Staar (1) ((1) IBM Research)
- Abstract summary: We present a novel approach to document structure recovery in PDF using recurrent neural networks.
We show how a sequence of PDF printing commands can be used as input into a neural network.
We implement a model that yields a weighted average F1 score of 97% across 17 distinct structural labels.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The number of published PDF documents has increased exponentially in recent
decades. There is a growing need to make their rich content discoverable to
information retrieval tools. In this paper, we present a novel approach to
document structure recovery in PDF using recurrent neural networks to process
the low-level PDF data representation directly, instead of relying on a visual
re-interpretation of the rendered PDF page, as has been proposed in previous
literature. We demonstrate how a sequence of PDF printing commands can be used
as input into a neural network and how the network can learn to classify each
printing command according to its structural function in the page. This
approach has three advantages: First, it can distinguish among more
fine-grained labels (typically 10-20 labels as opposed to 1-5 with visual
methods), which results in a more accurate and detailed document structure
resolution. Second, it can take into account the text flow across pages more
naturally compared to visual methods because it can concatenate the printing
commands of sequential pages. Last, our proposed method needs less memory and
it is computationally less expensive than visual methods. This allows us to
deploy such models in production environments at a much lower cost. Through
extensive architectural search in combination with advanced feature
engineering, we were able to implement a model that yields a weighted average
F1 score of 97% across 17 distinct structural labels. The best model we
achieved is currently served in production environments on our Corpus
Conversion Service (CCS), which was presented at KDD18 (arXiv:1806.02284). This
model enhances the capabilities of CCS significantly, as it eliminates the need
for human annotated label ground-truth for every unseen document layout. This
proved particularly useful when applied to a huge corpus of PDF articles
related to COVID-19.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model.
It is designed to reduce computational complexity to linear while preserving global modeling capabilities.
Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z) - GRAM: Global Reasoning for Multi-Page VQA [14.980413646626234]
We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
arXiv Detail & Related papers (2024-01-07T08:03:06Z) - CCpdf: Building a High Quality Corpus for Visually Rich Documents from
Web Crawl Data [2.7843134136364265]
This paper proposes an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl.
We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining.
arXiv Detail & Related papers (2023-04-28T16:12:18Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Document Domain Randomization for Deep Learning Document Layout
Extraction [37.97092983885967]
We present document domain randomization (DDR), the first successful transfer of convolutional neural networks (CNNs) trained only on graphically rendered pseudo-paper pages.
DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest.
We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy.
arXiv Detail & Related papers (2021-05-20T19:16:04Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.