WikiWhy: Answering and Explaining Cause-and-Effect Questions
- URL: http://arxiv.org/abs/2210.12152v1
- Date: Fri, 21 Oct 2022 17:59:03 GMT
- Title: WikiWhy: Answering and Explaining Cause-and-Effect Questions
- Authors: Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy,
Yujie Lu, William Yang Wang
- Abstract summary: We introduce WikiWhy, a QA dataset built around explaining why an answer is true in natural language.
WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics.
GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition.
- Score: 62.60993594814305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) grow larger and more sophisticated, assessing
their "reasoning" capabilities in natural language grows more challenging.
Recent question answering (QA) benchmarks that attempt to assess reasoning are
often limited by a narrow scope of covered situations and subject matters. We
introduce WikiWhy, a QA dataset built around a novel auxiliary task: explaining
why an answer is true in natural language. WikiWhy contains over 9,000 "why"
question-answer-rationale triples, grounded on Wikipedia facts across a diverse
set of topics. Each rationale is a set of supporting statements connecting the
question to the answer. WikiWhy serves as a benchmark for the reasoning
capabilities of LLMs because it demands rigorous explicit rationales for each
answer to demonstrate the acquisition of implicit commonsense knowledge, which
is unlikely to be easily memorized. GPT-3 baselines achieve only 38.7%
human-evaluated correctness in the end-to-end answer & explain condition,
leaving significant room for future improvements.
Related papers
- Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering [18.48602809114524]
Knowledge Graph Question Answering (KGQA) methods seek to answer Natural Language questions using the relational information stored in Knowledge Graphs (KGs)
With the recent advancements of Large Language Models (LLMs) and their remarkable reasoning abilities, there is a growing trend to leverage them for KGQA.
We propose Right for Right Reasons (R3), a commonsense KGQA methodology that allows for a verifiable reasoning procedure.
arXiv Detail & Related papers (2024-03-03T04:22:13Z) - Alexpaca: Learning Factual Clarification Question Generation Without Examples [19.663171923249283]
We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks.
Humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics.
arXiv Detail & Related papers (2023-10-17T20:40:59Z) - RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context.
In these situations, the model fails to distinguish the knowledge that is necessary to answer the question.
We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z) - STREET: A Multi-Task Structured Reasoning and Explanation Benchmark [56.555662318619135]
We introduce a unified multi-task and multi-domain natural language reasoning and explanation benchmark.
We expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer.
arXiv Detail & Related papers (2023-02-13T22:34:02Z) - Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous
Questions in VQA [33.11688014628816]
Resolving ambiguous questions is key to successfully answering them.
We create a dataset of ambiguous examples, grouping answers by the underlying question they address and rephrasing the question for each group to reduce ambiguity.
We then develop an English question-generation model which we demonstrate via automatic and human evaluation produces less ambiguous questions.
arXiv Detail & Related papers (2022-11-14T16:45:42Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Reasoning over Logically Interacted Conditions for Question Answering [113.9231035680578]
We study a more challenging task where answers are constrained by a list of conditions that logically interact.
We propose a new model, TReasoner, for this challenging reasoning task.
TReasoner achieves state-of-the-art performance on two benchmark conditional QA datasets.
arXiv Detail & Related papers (2022-05-25T16:41:39Z) - Single-Turn Debate Does Not Help Humans Answer Hard
Reading-Comprehension Questions [29.932543276414602]
We build a dataset of single arguments for both a correct and incorrect answer option in a debate-style set-up.
We use long contexts -- humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers.
We test if those explanations allow humans who have not read the full context to more accurately determine the correct answer.
arXiv Detail & Related papers (2022-04-11T15:56:34Z) - How Do We Answer Complex Questions: Discourse Structure of Long-form
Answers [51.973363804064704]
We study the functional structure of long-form answers collected from three datasets.
Our main goal is to understand how humans organize information to craft complex answers.
Our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.
arXiv Detail & Related papers (2022-03-21T15:14:10Z) - QED: A Framework and Dataset for Explanations in Question Answering [27.85923397716627]
We release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset.
A promising result suggests that training on a relatively small amount of QED data can improve question answering.
arXiv Detail & Related papers (2020-09-08T23:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.