DocFormerv2: Local Features for Document Understanding
- URL: http://arxiv.org/abs/2306.01733v1
- Date: Fri, 2 Jun 2023 17:58:03 GMT
- Title: DocFormerv2: Local Features for Document Understanding
- Authors: Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou,
R. Manmatha
- Abstract summary: We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU)
The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form.
Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features.
- Score: 15.669112678509522
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose DocFormerv2, a multi-modal transformer for Visual Document
Understanding (VDU). The VDU domain entails understanding documents (beyond
mere OCR predictions) e.g., extracting information from a form, VQA for
documents and other tasks. VDU is challenging as it needs a model to make sense
of multiple modalities (visual, language and spatial) to make a prediction. Our
approach, termed DocFormerv2 is an encoder-decoder transformer which takes as
input - vision, language and spatial features. DocFormerv2 is pre-trained with
unsupervised tasks employed asymmetrically i.e., two novel document tasks on
encoder and one on the auto-regressive decoder. The unsupervised tasks have
been carefully designed to ensure that the pre-training encourages
local-feature alignment between multiple modalities. DocFormerv2 when evaluated
on nine datasets shows state-of-the-art performance over strong baselines e.g.
TabFact (4.3%), InfoVQA (1.4%), FUNSD (1%). Furthermore, to show generalization
capabilities, on three VQA tasks involving scene-text, Doc- Formerv2
outperforms previous comparably-sized models and even does better than much
larger models (such as GIT2, PaLi and Flamingo) on some tasks. Extensive
ablations show that due to its pre-training, DocFormerv2 understands multiple
modalities better than prior-art in VDU.
Related papers
- Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding [91.17151775296234]
This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding.
Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space.
arXiv Detail & Related papers (2023-11-20T14:42:25Z) - Edit As You Wish: Video Caption Editing with Multi-grained User Control [61.76233268900959]
We propose a novel textbfVideo textbfCaption textbfEditing textbf(VCE) task to automatically revise an existing video description guided by multi-grained user requests.
Inspired by human writing-revision habits, we design the user command as a pivotal triplet textitoperation, position, attribute to cover diverse user needs from coarse-grained to fine-grained.
arXiv Detail & Related papers (2023-05-15T07:12:19Z) - Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich
Document Understanding [72.95838931445498]
Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks.
The way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy.
In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals.
arXiv Detail & Related papers (2022-06-27T09:58:34Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - LayoutLMv3: Pre-training for Document AI with Unified Text and Image
Masking [83.09001231165985]
We propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking.
The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks.
arXiv Detail & Related papers (2022-04-18T16:19:52Z) - DocFormer: End-to-End Transformer for Document Understanding [6.412887519128816]
We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.)
arXiv Detail & Related papers (2021-06-22T04:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.