Donut: Document Understanding Transformer without OCR
- URL: http://arxiv.org/abs/2111.15664v1
- Date: Tue, 30 Nov 2021 18:55:19 GMT
- Title: Donut: Document Understanding Transformer without OCR
- Authors: Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim,
Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park
- Abstract summary: We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
- Score: 17.397447819420695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding document images (e.g., invoices) has been an important research
topic and has many applications in document processing automation. Through the
latest advances in deep learning-based Optical Character Recognition (OCR),
current Visual Document Understanding (VDU) systems have come to be designed
based on OCR. Although such OCR-based approach promise reasonable performance,
they suffer from critical problems induced by the OCR, e.g., (1) expensive
computational costs and (2) performance degradation due to the OCR error
propagation. In this paper, we propose a novel VDU model that is end-to-end
trainable without underpinning OCR framework. To this end, we propose a new
task and a synthetic document image generator to pre-train the model to
mitigate the dependencies on large-scale real document images. Our approach
achieves state-of-the-art performance on various document understanding tasks
in public benchmark datasets and private industrial service datasets. Through
extensive experiments and analysis, we demonstrate the effectiveness of the
proposed model especially with consideration for a real-world application.
Related papers
- VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.
In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - Large Language Models for Page Stream Segmentation [0.03495246564946555]
Page Stream (PSS) is an essential prerequisite for automated document processing at scale.
This paper introduces TABME++, an enhanced benchmark featuring commercial Optical Character Recognition (OCR) annotations.
We evaluate the performance of large language models (LLMs) on PSS, focusing on decoder-based models fine-tuned with parameter-efficient methods.
arXiv Detail & Related papers (2024-08-21T20:28:42Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - OCR-IDL: OCR Annotations for Industry Document Library Dataset [8.905920197601171]
We make public the OCR annotations for IDL documents using commercial OCR engine.
The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$.
arXiv Detail & Related papers (2022-02-25T21:30:48Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z) - Unknown-box Approximation to Improve Optical Character Recognition
Performance [7.805544279853116]
A novel approach is presented for creating a customized preprocessor for a given OCR engine.
Experiments with two datasets and two OCR engines show that the presented preprocessor is able to improve the accuracy of the OCR up to 46% from the baseline.
arXiv Detail & Related papers (2021-05-17T16:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.