MoreHopQA: More Than Multi-hop Reasoning
- URL: http://arxiv.org/abs/2406.13397v1
- Date: Wed, 19 Jun 2024 09:38:59 GMT
- Title: MoreHopQA: More Than Multi-hop Reasoning
- Authors: Julian Schnitzler, Xanh Ho, Jiahao Huang, Florian Boudin, Saku Sugawara, Akiko Aizawa,
- Abstract summary: We propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers.
Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue.
Our results show that models perform well on initial multi-hop questions but struggle with our extended questions.
- Score: 32.94332511203639
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue. Instead of relying solely on factual reasoning, we enhance the existing multi-hop questions by adding another layer of questioning that involves one, two, or all three of the following types of reasoning: commonsense, arithmetic, and symbolic. Our dataset is created through a semi-automated process, resulting in a dataset with 1,118 samples that have undergone human verification. We then use our dataset to evaluate five different large language models: Mistral 7B, Gemma 7B, Llama 3 (8B and 70B), and GPT-4. We also design various cases to analyze the reasoning steps in the question-answering process. Our results show that models perform well on initial multi-hop questions but struggle with our extended questions, indicating that our dataset is more challenging than previous ones. Our analysis of question decomposition reveals that although models can correctly answer questions, only a portion - 38.7% for GPT-4 and 33.4% for Llama3-70B - achieve perfect reasoning, where all corresponding sub-questions are answered correctly. Evaluation code and data are available at https://github.com/Alab-NII/morehopqa
Related papers
- FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models [37.34801677290571]
FanOutQA is a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia.
We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B.
arXiv Detail & Related papers (2024-02-21T20:30:45Z) - How Well Do Multi-hop Reading Comprehension Models Understand Date
Information? [31.243088887839257]
The ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear.
It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems.
arXiv Detail & Related papers (2022-10-11T07:24:07Z) - Understanding and Improving Zero-shot Multi-hop Reasoning in Generative
Question Answering [85.79940770146557]
We decompose multi-hop questions into multiple corresponding single-hop questions.
We find marked inconsistency in QA models' answers on these pairs of ostensibly identical question chains.
When trained only on single-hop questions, models generalize poorly to multi-hop questions.
arXiv Detail & Related papers (2022-10-09T11:48:07Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Modeling Multi-hop Question Answering as Single Sequence Prediction [88.72621430714985]
We propose a simple generative approach (PathFid) that extends the task beyond just answer generation.
PathFid explicitly models the reasoning process to resolve the answer for multi-hop questions.
Our experiments demonstrate that PathFid leads to strong performance gains on two multi-hop QA datasets.
arXiv Detail & Related papers (2022-05-18T21:57:59Z) - MuSiQue: Multi-hop Questions via Single-hop Question Composition [36.84063888323547]
constructing multi-hop questions as composition of single-hop questions allows us to exercise greater control over the quality of the resulting multi-hop questions.
We use this process to construct a new multihop QA dataset: MuSiQue-Ans with 25K 2-4 hop questions using seed questions from 5 existing single-hop datasets.
arXiv Detail & Related papers (2021-08-02T00:33:27Z) - Question-Aware Memory Network for Multi-hop Question Answering in
Human-Robot Interaction [5.49601869466872]
We propose question-aware memory network for multi-hop question answering, named QA2MN, to update the attention on question timely in the reasoning process.
We evaluate QA2MN on PathQuestion and WorldCup2014, two representative datasets for complex multi-hop question answering.
arXiv Detail & Related papers (2021-04-27T13:32:41Z) - Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of
Reasoning Steps [31.472490306390977]
A multi-hop question answering dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question.
Previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question.
We present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data.
arXiv Detail & Related papers (2020-11-02T15:42:40Z) - Unsupervised Multi-hop Question Answering by Question Generation [108.61653629883753]
MQA-QG is an unsupervised framework that can generate human-like multi-hop training data.
Using only generated training data, we can train a competent multi-hop QA which achieves 61% and 83% of the supervised learning performance.
arXiv Detail & Related papers (2020-10-23T19:13:47Z) - Multi-hop Question Generation with Graph Convolutional Network [58.31752179830959]
Multi-hop Question Generation (QG) aims to generate answer-related questions by aggregating and reasoning over multiple scattered evidence from different paragraphs.
We propose Multi-Hop volution Fusion Network for Question Generation (MulQG), which does context encoding in multiple hops.
Our proposed model is able to generate fluent questions with high completeness and outperforms the strongest baseline by 20.8% in the multi-hop evaluation.
arXiv Detail & Related papers (2020-10-19T06:15:36Z) - Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected
Reasoning [50.114651561111245]
Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts.
We formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts.
Experiments suggest that there hasn't been much progress in multi-hop QA in the reading comprehension setting.
arXiv Detail & Related papers (2020-05-02T11:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.