GRAM: Global Reasoning for Multi-Page VQA
- URL: http://arxiv.org/abs/2401.03411v2
- Date: Mon, 18 Mar 2024 09:47:24 GMT
- Title: GRAM: Global Reasoning for Multi-Page VQA
- Authors: Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman,
- Abstract summary: We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
- Score: 14.980413646626234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computationally-heavy pretraining. To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens, facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens, we propose a tailored bias adaptation method. For additional computational savings during decoding, we introduce an optional compression stage using our compression-transformer (C-Former),reducing the encoded sequence length, thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our approach.
Related papers
- Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization [15.90651992769166]
A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements.
We propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document.
arXiv Detail & Related papers (2024-05-30T16:16:25Z) - Focus Anywhere for Fine-grained Multi-page Document Understanding [24.76897786595502]
This paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents.
We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages.
We render cross-vocabulary vision data as the foreground to achieve a full reaction of multiple visual vocabularies and in-document figure understanding.
arXiv Detail & Related papers (2024-05-23T08:15:49Z) - Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - CFRet-DVQA: Coarse-to-Fine Retrieval and Efficient Tuning for Document
Visual Question Answering [3.8065968624597324]
Document Visual Question Answering (DVQA) is a task that involves responding to queries based on the content of images.
Existing work is limited to locating information within a single page and does not facilitate cross-page question-and-answer interaction.
We introduce CFRet-DVQA, which focuses on retrieval and efficient tuning to address this critical issue effectively.
arXiv Detail & Related papers (2024-02-26T01:17:50Z) - Readout Guidance: Learning Control from Diffusion Features [96.22155562120231]
We present Readout Guidance, a method for controlling text-to-image diffusion models with learned signals.
Readout Guidance uses readout heads, lightweight networks trained to extract signals from the features of a pre-trained, frozen diffusion model at every timestep.
These readouts can encode single-image properties, such as pose, depth, and edges; or higher-order properties that relate multiple images, such as correspondence and appearance similarity.
arXiv Detail & Related papers (2023-12-04T18:59:32Z) - Context-Aware Classification of Legal Document Pages [7.306025535482021]
We present a simple but effective approach that overcomes the constraint on input length.
Specifically, we enhance the input with extra tokens carrying sequential information about previous pages.
Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
arXiv Detail & Related papers (2023-04-05T23:14:58Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.