Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
- URL: http://arxiv.org/abs/2404.19024v1
- Date: Mon, 29 Apr 2024 18:07:47 GMT
- Title: Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism
- Authors: Lei Kang, Rubèn Tito, Ernest Valveny, Dimosthenis Karatzas,
- Abstract summary: Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
- Score: 12.289101189321181
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Documents are 2-dimensional carriers of written communication, and as such their interpretation requires a multi-modal approach where textual and visual information are efficiently combined. Document Visual Question Answering (Document VQA), due to this multi-modal nature, has garnered significant interest from both the document understanding and natural language processing communities. The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle. They have to concatenate all pages into one large page for processing, demanding substantial GPU resources, even for evaluation. In this work, we propose a novel method and efficient training strategy for multi-page Document VQA tasks. In particular, we employ a visual-only document representation, leveraging the encoder from a document understanding model, Pix2Struct. Our approach utilizes a self-attention scoring mechanism to generate relevance scores for each document page, enabling the retrieval of pertinent pages. This adaptation allows us to extend single-page Document VQA models to multi-page scenarios without constraints on the number of pages during evaluation, all with minimal demand for GPU resources. Our extensive experiments demonstrate not only achieving state-of-the-art performance without the need for Optical Character Recognition (OCR), but also sustained performance in scenarios extending to documents of nearly 800 pages compared to a maximum of 20 pages in the MP-DocVQA dataset. Our code is publicly available at \url{https://github.com/leitro/SelfAttnScoring-MPDocVQA}.
Related papers
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context [26.820913216377903]
This work focuses on Regesta Pontificum Romanum, a large collection of papal registers.
Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents.
arXiv Detail & Related papers (2024-08-28T09:01:18Z) - Focus Anywhere for Fine-grained Multi-page Document Understanding [24.76897786595502]
This paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages.
We render cross-vocabulary vision data as the foreground to achieve a full reaction of multiple visual vocabularies and in-document figure understanding.
arXiv Detail & Related papers (2024-05-23T08:15:49Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - GRAM: Global Reasoning for Multi-Page VQA [14.980413646626234]
We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
arXiv Detail & Related papers (2024-01-07T08:03:06Z) - Multi-View Document Representation Learning for Open-Domain Dense
Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework.
It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries.
Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.