Donut: Document Understanding Transformer without OCR
- URL: http://arxiv.org/abs/2111.15664v1
- Date: Tue, 30 Nov 2021 18:55:19 GMT
- Title: Donut: Document Understanding Transformer without OCR
- Authors: Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim,
Wonseok Hwang, Sangdoo Yun, Dongyoon Han, Seunghyun Park
- Abstract summary: We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
- Score: 17.397447819420695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding document images (e.g., invoices) has been an important research
topic and has many applications in document processing automation. Through the
latest advances in deep learning-based Optical Character Recognition (OCR),
current Visual Document Understanding (VDU) systems have come to be designed
based on OCR. Although such OCR-based approach promise reasonable performance,
they suffer from critical problems induced by the OCR, e.g., (1) expensive
computational costs and (2) performance degradation due to the OCR error
propagation. In this paper, we propose a novel VDU model that is end-to-end
trainable without underpinning OCR framework. To this end, we propose a new
task and a synthetic document image generator to pre-train the model to
mitigate the dependencies on large-scale real document images. Our approach
achieves state-of-the-art performance on various document understanding tasks
in public benchmark datasets and private industrial service datasets. Through
extensive experiments and analysis, we demonstrate the effectiveness of the
proposed model especially with consideration for a real-world application.
Related papers
- DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution [48.693941280097974]
We propose a large-scale dataset consisting of camera images for visual information extraction (VIE)
We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion.
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
arXiv Detail & Related papers (2023-05-12T14:11:47Z) - User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%.
Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z) - OCR-IDL: OCR Annotations for Industry Document Library Dataset [8.905920197601171]
We make public the OCR annotations for IDL documents using commercial OCR engine.
The contributed dataset (OCR-IDL) has an estimated monetary value over 20K US$.
arXiv Detail & Related papers (2022-02-25T21:30:48Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z) - Unknown-box Approximation to Improve Optical Character Recognition
Performance [7.805544279853116]
A novel approach is presented for creating a customized preprocessor for a given OCR engine.
Experiments with two datasets and two OCR engines show that the presented preprocessor is able to improve the accuracy of the OCR up to 46% from the baseline.
arXiv Detail & Related papers (2021-05-17T16:09:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.