NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset
- URL: http://arxiv.org/abs/2109.10604v1
- Date: Wed, 22 Sep 2021 09:17:09 GMT
- Title: NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering
Dataset
- Authors: Qiyuan Zhang, Lei Wang, Sicheng Yu, Shuohang Wang, Yang Wang, Jing
Jiang, Ee-Peng Lim
- Abstract summary: We introduce NOAHQA, a bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions.
We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores.
We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans.
- Score: 26.782937852417454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While diverse question answering (QA) datasets have been proposed and
contributed significantly to the development of deep learning models for QA
tasks, the existing datasets fall short in two aspects. First, we lack QA
datasets covering complex questions that involve answers as well as the
reasoning processes to get the answers. As a result, the state-of-the-art QA
research on numerical reasoning still focuses on simple calculations and does
not provide the mathematical expressions or evidences justifying the answers.
Second, the QA community has contributed much effort to improving the
interpretability of QA models. However, these models fail to explicitly show
the reasoning process, such as the evidence order for reasoning and the
interactions between different pieces of evidence. To address the above
shortcomings, we introduce NOAHQA, a conversational and bilingual QA dataset
with questions requiring numerical reasoning with compound mathematical
expressions. With NOAHQA, we develop an interpretable reasoning graph as well
as the appropriate evaluation metric to measure the answer quality. We evaluate
the state-of-the-art QA models trained using existing QA datasets on NOAHQA and
show that the best among them can only achieve 55.5 exact match scores, while
the human performance is 89.7. We also present a new QA model for generating a
reasoning graph where the reasoning graph metric still has a large gap compared
with that of humans, e.g., 28 scores.
Related papers
- GoT-CQA: Graph-of-Thought Guided Compositional Reasoning for Chart Question Answering [12.485921065840294]
Chart Question Answering (CQA) aims at answering questions based on the visual chart content.
We propose a novel Graph-of-Thought (GoT) guided compositional reasoning model called GoT-CQA.
GoT-CQA achieves outstanding performance, especially in complex human-written and reasoning questions.
arXiv Detail & Related papers (2024-09-04T10:56:05Z) - Generalizing Visual Question Answering from Synthetic to Human-Written Questions via a Chain of QA with a Large Language Model [4.41132900194195]
We propose a new method called it chain of QA for human-written questions (CoQAH)
CoQAH utilizes a sequence of QA interactions between a large language model and a VQA model trained on synthetic data to reason and derive logical answers for human-written questions.
We tested the effectiveness of CoQAH on two types of human-written VQA datasets for 3D-rendered and chest X-ray images.
arXiv Detail & Related papers (2024-01-12T06:49:49Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Improving Unsupervised Question Answering via Summarization-Informed
Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair.
We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling.
The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z) - QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question
Answering [122.84513233992422]
We propose a new model, QA-GNN, which addresses the problem of answering questions using knowledge from pre-trained language models (LMs) and knowledge graphs (KGs)
We show its improvement over existing LM and LM+KG models, as well as its capability to perform interpretable and structured reasoning.
arXiv Detail & Related papers (2021-04-13T17:32:51Z) - What Gives the Answer Away? Question Answering Bias Analysis on Video QA
Datasets [40.64071905569975]
Question answering biases in video QA datasets can mislead multimodal model to overfit to QA artifacts.
Our study shows biases can come from annotators and type of questions.
We also show empirically that using annotator-non-overlapping train-test splits can reduce QA biases for video QA datasets.
arXiv Detail & Related papers (2020-07-07T17:00:11Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.