CFRet-DVQA: Coarse-to-Fine Retrieval and Efficient Tuning for Document
Visual Question Answering
- URL: http://arxiv.org/abs/2403.00816v1
- Date: Mon, 26 Feb 2024 01:17:50 GMT
- Title: CFRet-DVQA: Coarse-to-Fine Retrieval and Efficient Tuning for Document
Visual Question Answering
- Authors: Jinxu Zhang, Yongqi Yu, Yu Zhang
- Abstract summary: Document Visual Question Answering (DVQA) is a task that involves responding to queries based on the content of images.
Existing work is limited to locating information within a single page and does not facilitate cross-page question-and-answer interaction.
We introduce CFRet-DVQA, which focuses on retrieval and efficient tuning to address this critical issue effectively.
- Score: 3.8065968624597324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document Visual Question Answering (DVQA) is a task that involves responding
to queries based on the content of images. Existing work is limited to locating
information within a single page and does not facilitate cross-page
question-and-answer interaction. Furthermore, the token length limitation
imposed on inputs to the model may lead to truncation of segments pertinent to
the answer. In this study, we introduce a simple but effective methodology
called CFRet-DVQA, which focuses on retrieval and efficient tuning to address
this critical issue effectively. For that, we initially retrieve multiple
segments from the document that correlate with the question at hand.
Subsequently, we leverage the advanced reasoning abilities of the large
language model (LLM), further augmenting its performance through instruction
tuning. This approach enables the generation of answers that align with the
style of the document labels. The experiments demonstrate that our methodology
achieved state-of-the-art or competitive results with both single-page and
multi-page documents in various fields.
Related papers
- Multi-Page Document Visual Question Answering using Self-Attention Scoring Mechanism [12.289101189321181]
Document Visual Question Answering (Document VQA) has garnered significant interest from both the document understanding and natural language processing communities.
The state-of-the-art single-page Document VQA methods show impressive performance, yet in multi-page scenarios, these methods struggle.
We propose a novel method and efficient training strategy for multi-page Document VQA tasks.
arXiv Detail & Related papers (2024-04-29T18:07:47Z) - GRAM: Global Reasoning for Multi-Page VQA [14.980413646626234]
We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting.
To do so, we leverage a single-page encoder for local page-level understanding, and enhance it with document-level designated layers and learnable tokens.
For additional computational savings during decoding, we introduce an optional compression stage.
arXiv Detail & Related papers (2024-01-07T08:03:06Z) - Enhancing BERT-Based Visual Question Answering through Keyword-Driven
Sentence Selection [8.586466827855016]
Document-based Visual Question Answering competition addresses the automatic detection of parent-child relationships in documents.
This paper describes the PoliTo's approach to addressing this task, in particular, our best solution explores a text-only approach.
Thanks to the effectiveness of this approach, we are able to achieve high performance compared to baselines.
arXiv Detail & Related papers (2023-10-13T22:43:55Z) - Information Extraction from Documents: Question Answering vs Token
Classification in real-world setups [0.0]
We compare the Question Answering approach with the classical token classification approach for document key information extraction.
Our research showed that when dealing with clean and relatively short entities, it is still best to use token classification-based approach.
arXiv Detail & Related papers (2023-04-21T14:43:42Z) - CAPSTONE: Curriculum Sampling for Dense Retrieval with Document
Expansion [68.19934563919192]
We propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query.
Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.
arXiv Detail & Related papers (2022-12-18T15:57:46Z) - Multi-View Document Representation Learning for Open-Domain Dense
Retrieval [87.11836738011007]
This paper proposes a multi-view document representation learning framework.
It aims to produce multi-view embeddings to represent documents and enforce them to align with different queries.
Experiments show our method outperforms recent works and achieves state-of-the-art results.
arXiv Detail & Related papers (2022-03-16T03:36:38Z) - Modeling Endorsement for Multi-Document Abstractive Summarization [10.166639983949887]
A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s)
In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization.
Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents.
arXiv Detail & Related papers (2021-10-15T03:55:42Z) - Tradeoffs in Sentence Selection Techniques for Open-Domain Question
Answering [54.541952928070344]
We describe two groups of models for sentence selection: QA-based approaches, which run a full-fledged QA system to identify answer candidates, and retrieval-based models, which find parts of each passage specifically related to each question.
We show that very lightweight QA models can do well at this task, but retrieval-based models are faster still.
arXiv Detail & Related papers (2020-09-18T23:39:15Z) - Answering Any-hop Open-domain Questions with Iterative Document
Reranking [62.76025579681472]
We propose a unified QA framework to answer any-hop open-domain questions.
Our method consistently achieves performance comparable to or better than the state-of-the-art on both single-hop and multi-hop open-domain QA datasets.
arXiv Detail & Related papers (2020-09-16T04:31:38Z) - Knowledge-Aided Open-Domain Question Answering [58.712857964048446]
We propose a knowledge-aided open-domain QA (KAQA) method which targets at improving relevant document retrieval and answer reranking.
During document retrieval, a candidate document is scored by considering its relationship to the question and other documents.
During answer reranking, a candidate answer is reranked using not only its own context but also the clues from other documents.
arXiv Detail & Related papers (2020-06-09T13:28:57Z) - Query Focused Multi-Document Summarization with Distant Supervision [88.39032981994535]
Existing work relies heavily on retrieval-style methods for estimating the relevance between queries and text segments.
We propose a coarse-to-fine modeling framework which introduces separate modules for estimating whether segments are relevant to the query.
We demonstrate that our framework outperforms strong comparison systems on standard QFS benchmarks.
arXiv Detail & Related papers (2020-04-06T22:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.