Towards Complex Document Understanding By Discrete Reasoning
- URL: http://arxiv.org/abs/2207.11871v3
- Date: Thu, 4 May 2023 14:30:01 GMT
- Title: Towards Complex Document Understanding By Discrete Reasoning
- Authors: Fengbin Zhu, Wenqiang Lei, Fuli Feng, Chao Wang, Haozhou Zhang,
Tat-Seng Chua
- Abstract summary: Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
- Score: 77.91722463958743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document Visual Question Answering (VQA) aims to understand visually-rich
documents to answer questions in natural language, which is an emerging
research topic for both Natural Language Processing and Computer Vision. In
this work, we introduce a new Document VQA dataset, named TAT-DQA, which
consists of 3,067 document pages comprising semi-structured table(s) and
unstructured text as well as 16,558 question-answer pairs by extending the
TAT-QA dataset. These documents are sampled from real-world financial reports
and contain lots of numbers, which means discrete reasoning capability is
demanded to answer questions on this dataset. Based on TAT-DQA, we further
develop a novel model named MHST that takes into account the information in
multi-modalities, including text, layout and visual image, to intelligently
address different types of questions with corresponding strategies, i.e.,
extraction or reasoning. Extensive experiments show that the MHST model
significantly outperforms the baseline methods, demonstrating its
effectiveness. However, the performance still lags far behind that of expert
humans. We expect that our new TAT-DQA dataset would facilitate the research on
deep understanding of visually-rich documents combining vision and language,
especially for scenarios that require discrete reasoning. Also, we hope the
proposed model would inspire researchers to design more advanced Document VQA
models in future. Our dataset will be publicly available for non-commercial use
at https://nextplusplus.github.io/TAT-DQA/.
Related papers
- CT2C-QA: Multimodal Question Answering over Chinese Text, Table and Chart [26.54501344351476]
We present C$textT2$C-QA, a pioneering Chinese reasoning-based QA dataset that includes an extensive collection of text, tables, and charts.
Our dataset simulates real webpages and serves as a great test for the capability of the model to analyze and reason with multimodal data.
arXiv Detail & Related papers (2024-10-28T18:13:14Z) - SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers [43.18330795060871]
SPIQA is a dataset specifically designed to interpret complex figures and tables within the context of scientific research articles.
We employ automatic and manual curation to create the dataset.
SPIQA comprises 270K questions divided into training, validation, and three different evaluation splits.
arXiv Detail & Related papers (2024-07-12T16:37:59Z) - DCQA: Document-Level Chart Question Answering towards Complex Reasoning
and Common-Sense Understanding [19.713647367008143]
We introduce a novel task named document-level chart question answering (DCQA)
The newly developed benchmark dataset comprises 50,010 synthetic documents integrating charts in a wide range of styles.
We present the development of a potent question-answer generation engine that employs table data, a rich color set, and basic question templates.
arXiv Detail & Related papers (2023-10-29T11:38:08Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - QTSumm: Query-Focused Summarization over Tabular Data [58.62152746690958]
People primarily consult tables to conduct data analysis or answer specific questions.
We define a new query-focused table summarization task, where text generation models have to perform human-like reasoning.
We introduce a new benchmark named QTSumm for this task, which contains 7,111 human-annotated query-summary pairs over 2,934 tables.
arXiv Detail & Related papers (2023-05-23T17:43:51Z) - Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text
Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document.
Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability.
We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z) - TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and
Textual Content in Finance [71.76018597965378]
We build a new large-scale Question Answering dataset containing both Tabular And Textual data, named TAT-QA.
We propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text.
arXiv Detail & Related papers (2021-05-17T06:12:06Z) - A survey on VQA_Datasets and Approaches [0.0]
Visual question answering (VQA) is a task that combines the techniques of computer vision and natural language processing.
This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
arXiv Detail & Related papers (2021-05-02T08:50:30Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.