DUBLIN -- Document Understanding By Language-Image Network
- URL: http://arxiv.org/abs/2305.14218v4
- Date: Fri, 27 Oct 2023 15:08:31 GMT
- Title: DUBLIN -- Document Understanding By Language-Image Network
- Authors: Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan,
Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som,
Vishrav Chaudhary, Saurabh Tiwary
- Abstract summary: We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
- Score: 37.42637168606938
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Visual document understanding is a complex task that involves analyzing both
the text and the visual elements in document images. Existing models often rely
on manual feature engineering or domain-specific pipelines, which limit their
generalization ability across different document types and languages. In this
paper, we propose DUBLIN, which is pretrained on web pages using three novel
objectives: Masked Document Text Generation Task, Bounding Box Task, and
Rendered Question Answering Task, that leverage both the spatial and semantic
information in the document images. Our model achieves competitive or
state-of-the-art results on several benchmarks, such as Web-Based Structural
Reading Comprehension, Document Visual Question Answering, Key Information
Extraction, Diagram Understanding, and Table Question Answering. In particular,
we show that DUBLIN is the first pixel-based model to achieve an EM of 77.75
and F1 of 84.25 on the WebSRC dataset. We also show that our model outperforms
the current pixel-based SOTA models on DocVQA, InfographicsVQA, OCR-VQA and
AI2D datasets by 4.6%, 6.5%, 2.6% and 21%, respectively. We also achieve
competitive performance on RVL-CDIP document classification. Moreover, we
create new baselines for text-based datasets by rendering them as document
images to promote research in this direction.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents [8.49076413640561]
The model is pre-trained and subsequently fine-tuned for various document image analysis tasks.
The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification.
arXiv Detail & Related papers (2023-10-25T10:22:30Z) - OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation [151.57313182844936]
We propose a new interleaved generation framework based on prompting large-language models (LLMs) and pre-trained text-to-image (T2I) models, namely OpenLEAF.
For model assessment, we first propose to use large multi-modal models (LMMs) to evaluate the entity and style consistencies of open-domain interleaved image-text sequences.
arXiv Detail & Related papers (2023-10-11T17:58:33Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation [42.35572014527354]
The AToMiC dataset is designed to advance research in image/text cross-modal retrieval.
We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
arXiv Detail & Related papers (2023-04-04T17:11:34Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Spatial Dual-Modality Graph Reasoning for Key Information Extraction [31.04597531115209]
We propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images.
We release a new dataset named WildReceipt, which is collected and annotated for the evaluation of key information extraction from document images of unseen templates in the wild.
arXiv Detail & Related papers (2021-03-26T13:46:00Z) - LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document
Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks.
We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.