AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
- URL: http://arxiv.org/abs/2508.13606v1
- Date: Tue, 19 Aug 2025 08:12:45 GMT
- Title: AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
- Authors: Haoxuan Li, Wei Song, Aofan Liu, Peiwu Qin,
- Abstract summary: Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments.<n>This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations.<n> Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04% accuracy on Yes/No questions.
- Score: 8.22650587342049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04\% accuracy on Yes/No questions, 52.66\% on factual questions, and 44.12\% on numerical questions in JDocQA, and 59\% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.
Related papers
- ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering [54.72902502486611]
ReAG is a Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages.<n>ReAG significantly outperforms prior methods, improving answer accuracy and providing interpretable reasoning grounded in retrieved evidence.
arXiv Detail & Related papers (2025-11-27T19:01:02Z) - OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive [50.468138755368805]
Opioid crisis represents a significant moment in public health.<n>Data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA)<n>In this paper, we tackle this challenge by organizing the original dataset according to document attributes.
arXiv Detail & Related papers (2025-11-13T03:27:32Z) - ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation [12.784082281917003]
ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer pairs.<n>The dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems.
arXiv Detail & Related papers (2025-11-05T17:13:14Z) - Query Decomposition for RAG: Balancing Exploration-Exploitation [83.79639293409802]
RAG systems address complex user requests by decomposing them into subqueries, retrieving potentially relevant documents for each, and then aggregating them to generate an answer.<n>We formulate query decomposition and document retrieval in an exploitation-exploration setting, where retrieving one document at a time builds a belief about the utility of a given sub-queries.<n>Our main finding is that estimating document relevance using rank information and human judgments yields a 35% gain in document-level precision, 15% increase in alpha-nDCG, and better performance on the downstream task of long-form generation.
arXiv Detail & Related papers (2025-10-21T13:37:11Z) - PDF Retrieval Augmented Question Answering [14.617711623828248]
This paper presents an advancement in Question-Answering (QA) systems using a Retrieval Augmented Generation (RAG) framework.<n>We seek to develop a comprehensive RAG-based QA system that will effectively address complex multimodal questions.
arXiv Detail & Related papers (2025-06-22T13:14:19Z) - Beyond Retrieval: Joint Supervision and Multimodal Document Ranking for Textbook Question Answering [3.6799953119508735]
We propose a novel approach to multimodal textbook question answering by introducing a mechanism for enhancing semantic representations.<n>Our model, Joint Embedding Training With Ranking Supervision for Textbook Question Answering (JETRTQA), is a multimodal learning framework built on a retriever-generator architecture.<n>We evaluate our method on the CK12-QA dataset and demonstrate that it significantly improves the discrimination between informative and irrelevant documents.
arXiv Detail & Related papers (2025-05-17T13:23:54Z) - ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - HiQA: A Hierarchical Contextual Augmentation RAG for Multi-Documents QA [13.000411428297813]
We present HiQA, an advanced multi-document question-answering (MDQA) framework that integrates cascading metadata into content and a multi-route retrieval mechanism.
We also release a benchmark called MasQA to evaluate and research in MDQA.
arXiv Detail & Related papers (2024-02-01T02:24:15Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.