PDFTriage: Question Answering over Long, Structured Documents
- URL: http://arxiv.org/abs/2309.08872v2
- Date: Wed, 8 Nov 2023 05:09:28 GMT
- Title: PDFTriage: Question Answering over Long, Structured Documents
- Authors: Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun
Yoon, Ryan A. Rossi, Franck Dernoncourt
- Abstract summary: Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
- Score: 60.96667912964659
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have issues with document question answering
(QA) in situations where the document is unable to fit in the small context
length of an LLM. To overcome this issue, most existing works focus on
retrieving the relevant context from the document, representing them as plain
text. However, documents such as PDFs, web pages, and presentations are
naturally structured with different pages, tables, sections, and so on.
Representing such structured documents as plain text is incongruous with the
user's mental model of these documents with rich structure. When a system has
to query the document for context, this incongruity is brought to the fore, and
seemingly trivial questions can trip up the QA system. To bridge this
fundamental gap in handling structured documents, we propose an approach called
PDFTriage that enables models to retrieve the context based on either structure
or content. Our experiments demonstrate the effectiveness of the proposed
PDFTriage-augmented models across several classes of questions where existing
retrieval-augmented LLMs fail. To facilitate further research on this
fundamental problem, we release our benchmark dataset consisting of 900+
human-generated questions over 80 structured documents from 10 different
categories of question types for document QA. Our code and datasets will be
released soon on Github.
Related papers
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding [63.33447665725129]
We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts.
M3DocRAG can efficiently handle single or many documents while preserving visual information.
We also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages.
arXiv Detail & Related papers (2024-11-07T18:29:38Z) - PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering [13.625303311724757]
Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD)
We propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval.
arXiv Detail & Related papers (2024-04-19T09:00:05Z) - JDocQA: Japanese Document Question Answering Dataset for Generative Language Models [15.950718839723027]
We introduce Japanese Document Question Answering (JDocQA), a large-scale document-based QA dataset.
It comprises 5,504 documents in PDF format and annotated 11,600 question-and-answer instances in Japanese.
We incorporate multiple categories of questions and unanswerable questions from the document for realistic question-answering applications.
arXiv Detail & Related papers (2024-03-28T14:22:54Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - PDFVQA: A New Dataset for Real-World VQA on PDF Documents [2.105395241374678]
Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions.
Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages.
arXiv Detail & Related papers (2023-04-13T12:28:14Z) - Cross-Modal Entity Matching for Visually Rich Documents [4.8119678510491815]
Visually rich documents utilize visual cues to augment their semantics.
Existing works that enable structured querying on these documents do not take this into account.
We propose Juno -- a cross-modal entity matching framework to address this limitation.
arXiv Detail & Related papers (2023-03-01T18:26:14Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.