Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit
Reasoning Strategies
- URL: http://arxiv.org/abs/2101.02235v1
- Date: Wed, 6 Jan 2021 19:14:23 GMT
- Title: Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit
Reasoning Strategies
- Authors: Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, Jonathan
Berant
- Abstract summary: StrategyQA is a benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy.
We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts.
Overall, StrategyQA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs.
- Score: 78.68534915690404
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: A key limitation in current datasets for multi-hop reasoning is that the
required steps for answering the question are mentioned in it explicitly. In
this work, we introduce StrategyQA, a question answering (QA) benchmark where
the required reasoning steps are implicit in the question, and should be
inferred using a strategy. A fundamental challenge in this setup is how to
elicit such creative questions from crowdsourcing workers, while covering a
broad range of potential strategies. We propose a data collection procedure
that combines term-based priming to inspire annotators, careful control over
the annotator population, and adversarial filtering for eliminating reasoning
shortcuts. Moreover, we annotate each question with (1) a decomposition into
reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the
answers to each step. Overall, StrategyQA includes 2,780 examples, each
consisting of a strategy question, its decomposition, and evidence paragraphs.
Analysis shows that questions in StrategyQA are short, topic-diverse, and cover
a wide range of strategies. Empirically, we show that humans perform well (87%)
on this task, while our best baseline reaches an accuracy of $\sim$66%.
Related papers
- Question-Analysis Prompting Improves LLM Performance in Reasoning Tasks [3.741953084205603]
We propose a novel prompting strategy called Question Analysis Prompting (QAP)
QAP is evaluated on GPT 3.5 Turbo and GPT 4 Turbo on arithmetic datasets GSM8K, AQuA, and SAT and commonsense dataset StrategyQA.
QAP consistently ranks among the top-2 prompts on 75% of the tests.
arXiv Detail & Related papers (2024-07-04T04:19:50Z) - Paths to Equilibrium in Games [6.812247730094933]
We study sequences of strategies satisfying a pairwise constraint inspired by policy updating in reinforcement learning.
Our analysis reveals a counterintuitive insight that reward deteriorating strategic updates are key to driving play to equilibrium along a satisficing path.
arXiv Detail & Related papers (2024-03-26T19:58:39Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - StrategyLLM: Large Language Models as Strategy Generators, Executors, Optimizers, and Evaluators for Problem Solving [76.5322280307861]
StrategyLLM allows LLMs to perform inductive reasoning, deriving general strategies from specific task instances, and deductive reasoning, applying these general strategies to particular task examples, for constructing generalizable and consistent few-shot prompts.
Experimental results demonstrate that StrategyLLM outperforms the competitive baseline CoT-SC that requires human-annotated solutions on 13 datasets across 4 challenging tasks without human involvement, including math reasoning (34.2% $rightarrow$ 38.8%), commonsense reasoning (70.3% $rightarrow$ 72.5%), algorithmic reasoning (73.7% $rightarrow$ 85.0
arXiv Detail & Related papers (2023-11-15T09:18:09Z) - On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question
Generation [57.630606799713526]
We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer.
First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs.
Second, we compare different strategies to utilize a pre-trained seq2seq model to generate and select a set of questions related to a given paragraph.
arXiv Detail & Related papers (2022-03-09T00:55:54Z) - AnswerSumm: A Manually-Curated Dataset and Pipeline for Answer
Summarization [73.91543616777064]
Community Question Answering (CQA) fora such as Stack Overflow and Yahoo! Answers contain a rich resource of answers to a wide range of community-based questions.
One goal of answer summarization is to produce a summary that reflects the range of answer perspectives.
This work introduces a novel dataset of 4,631 CQA threads for answer summarization, curated by professional linguists.
arXiv Detail & Related papers (2021-11-11T21:48:02Z) - Adaptive Information Seeking for Open-Domain Question Answering [61.39330982757494]
We propose a novel adaptive information-seeking strategy for open-domain question answering, namely AISO.
According to the learned policy, AISO could adaptively select a proper retrieval action to seek the missing evidence at each step.
AISO outperforms all baseline methods with predefined strategies in terms of both retrieval and answer evaluations.
arXiv Detail & Related papers (2021-09-14T15:08:13Z) - Coarse-grained decomposition and fine-grained interaction for multi-hop
question answering [5.88731657602706]
Lots of complex queries require multi-hop reasoning.
Bi-DAF generally captures only the surface semantics of words in complex questions.
We propose a new model architecture for multi-hop question answering.
arXiv Detail & Related papers (2021-01-15T06:56:34Z) - Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question.
Most open QA systems have considered only retrieving information from unstructured text.
We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.