Single-Turn Debate Does Not Help Humans Answer Hard
Reading-Comprehension Questions
- URL: http://arxiv.org/abs/2204.05212v2
- Date: Wed, 13 Apr 2022 13:46:13 GMT
- Title: Single-Turn Debate Does Not Help Humans Answer Hard
Reading-Comprehension Questions
- Authors: Alicia Parrish and Harsh Trivedi and Ethan Perez and Angelica Chen and
Nikita Nangia and Jason Phang and Samuel R. Bowman
- Abstract summary: We build a dataset of single arguments for both a correct and incorrect answer option in a debate-style set-up.
We use long contexts -- humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers.
We test if those explanations allow humans who have not read the full context to more accurately determine the correct answer.
- Score: 29.932543276414602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current QA systems can generate reasonable-sounding yet false answers without
explanation or evidence for the generated answer, which is especially
problematic when humans cannot readily check the model's answers. This presents
a challenge for building trust in machine learning systems. We take inspiration
from real-world situations where difficult questions are answered by
considering opposing sides (see Irving et al., 2018). For multiple-choice QA
examples, we build a dataset of single arguments for both a correct and
incorrect answer option in a debate-style set-up as an initial step in training
models to produce explanations for two candidate answers. We use long contexts
-- humans familiar with the context write convincing explanations for
pre-selected correct and incorrect answers, and we test if those explanations
allow humans who have not read the full context to more accurately determine
the correct answer. We do not find that explanations in our set-up improve
human accuracy, but a baseline condition shows that providing human-selected
text snippets does improve accuracy. We use these findings to suggest ways of
improving the debate set up for future data collection efforts.
Related papers
- Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - Overinformative Question Answering by Humans and Machines [26.31070412632125]
We show that overinformativeness in human answering is driven by considerations of relevance to the questioner's goals.
We show that GPT-3 is highly sensitive to the form of the prompt and only human-like answer patterns when guided by an example and cognitively-motivated explanation.
arXiv Detail & Related papers (2023-05-11T21:41:41Z) - CREPE: Open-Domain Question Answering with False Presuppositions [92.20501870319765]
We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums.
We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections.
We show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct.
arXiv Detail & Related papers (2022-11-30T18:54:49Z) - WikiWhy: Answering and Explaining Cause-and-Effect Questions [62.60993594814305]
We introduce WikiWhy, a QA dataset built around explaining why an answer is true in natural language.
WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics.
GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition.
arXiv Detail & Related papers (2022-10-21T17:59:03Z) - Two-Turn Debate Doesn't Help Humans Answer Hard Reading Comprehension
Questions [26.404441861051875]
We assess whether presenting humans with arguments for two competing answer options allows human judges to perform more accurately.
Previous research has shown that just a single turn of arguments in this format is not helpful to humans.
We find that, regardless of whether they have access to arguments or not, humans perform similarly on our task.
arXiv Detail & Related papers (2022-10-19T19:48:50Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Ranking Facts for Explaining Answers to Elementary Science Questions [1.4091801425319965]
In elementary science exams, students select one answer from among typically four choices and can explain why they made that particular choice.
We consider the novel task of generating explanations for answers from human-authored facts.
Explanations are created from a human-annotated set of nearly 5,000 candidate facts in the WorldTree corpus.
arXiv Detail & Related papers (2021-10-18T06:15:11Z) - Prompting Contrastive Explanations for Commonsense Reasoning Tasks [74.7346558082693]
Large pretrained language models (PLMs) can achieve near-human performance on commonsense reasoning tasks.
We show how to use these same models to generate human-interpretable evidence.
arXiv Detail & Related papers (2021-06-12T17:06:13Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z) - Challenges in Information-Seeking QA: Unanswerable Questions and
Paragraph Retrieval [46.3246135936476]
We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise.
Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction.
We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer.
arXiv Detail & Related papers (2020-10-22T17:48:17Z) - QED: A Framework and Dataset for Explanations in Question Answering [27.85923397716627]
We release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset.
A promising result suggests that training on a relatively small amount of QED data can improve question answering.
arXiv Detail & Related papers (2020-09-08T23:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.