DocFormer: End-to-End Transformer for Document Understanding
- URL: http://arxiv.org/abs/2106.11539v1
- Date: Tue, 22 Jun 2021 04:28:07 GMT
- Title: DocFormer: End-to-End Transformer for Document Understanding
- Authors: Srikar Appalaraju and Bhavan Jasani and Bhargava Urala Kota and
Yusheng Xie and R. Manmatha
- Abstract summary: We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.)
- Score: 6.412887519128816
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present DocFormer -- a multi-modal transformer based architecture for the
task of Visual Document Understanding (VDU). VDU is a challenging problem which
aims to understand documents in their varied formats (forms, receipts etc.) and
layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using
carefully designed tasks which encourage multi-modal interaction. DocFormer
uses text, vision and spatial features and combines them using a novel
multi-modal self-attention layer. DocFormer also shares learned spatial
embeddings across modalities which makes it easy for the model to correlate
text to visual tokens and vice versa. DocFormer is evaluated on 4 different
datasets each with strong baselines. DocFormer achieves state-of-the-art
results on all of them, sometimes beating models 4x its size (in no. of
parameters).
Related papers
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z) - DocFormerv2: Local Features for Document Understanding [15.669112678509522]
We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU)
The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form.
Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features.
arXiv Detail & Related papers (2023-06-02T17:58:03Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.