Teaching language models to support answers with verified quotes
- URL: http://arxiv.org/abs/2203.11147v1
- Date: Mon, 21 Mar 2022 17:26:29 GMT
- Title: Teaching language models to support answers with verified quotes
- Authors: Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis
Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham,
Geoffrey Irving, Nat McAleese
- Abstract summary: We train "open-book" QA models that generate answers whilst also citing specific evidence for their claims.
Our 280 billion parameter model, GopherCite, is able to produce answers with high quality supporting evidence and abstain from answering when unsure.
- Score: 12.296242080730831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent large language models often answer factual questions correctly. But
users can't trust any given claim a model makes without fact-checking, because
language models can hallucinate convincing nonsense. In this work we use
reinforcement learning from human preferences (RLHP) to train "open-book" QA
models that generate answers whilst also citing specific evidence for their
claims, which aids in the appraisal of correctness. Supporting evidence is
drawn from multiple documents found via a search engine, or from a single
user-provided document. Our 280 billion parameter model, GopherCite, is able to
produce answers with high quality supporting evidence and abstain from
answering when unsure. We measure the performance of GopherCite by conducting
human evaluation of answers to questions in a subset of the NaturalQuestions
and ELI5 datasets. The model's response is found to be high-quality 80\% of the
time on this Natural Questions subset, and 67\% of the time on the ELI5 subset.
Abstaining from the third of questions for which it is most unsure improves
performance to 90\% and 80\% respectively, approaching human baselines.
However, analysis on the adversarial TruthfulQA dataset shows why citation is
only one part of an overall strategy for safety and trustworthiness: not all
claims supported by evidence are true.
Related papers
- Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers.
We present users with incorrect answers and corresponding rationales in various formats.
We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z) - Model Analysis & Evaluation for Ambiguous Question Answering [0.0]
Question Answering models are required to generate long-form answers that often combine conflicting pieces of information.
Recent advances in the field have shown strong capabilities in generating fluent responses, but certain research questions remain unanswered.
We aim to thoroughly investigate these aspects, and provide valuable insights into the limitations of the current approaches.
arXiv Detail & Related papers (2023-05-21T15:20:20Z) - CREPE: Open-Domain Question Answering with False Presuppositions [92.20501870319765]
We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums.
We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections.
We show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct.
arXiv Detail & Related papers (2022-11-30T18:54:49Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Grow-and-Clip: Informative-yet-Concise Evidence Distillation for Answer
Explanation [22.20733260041759]
We argue that the evidences of an answer is critical to enhancing the interpretability of QA models.
We are the first to explicitly define the concept of evidence as the supporting facts in a context which are informative, concise, and readable.
We propose Grow-and-Clip Evidence Distillation (GCED) algorithm to extract evidences from the contexts by trade-off informativeness, conciseness, and readability.
arXiv Detail & Related papers (2022-01-13T17:18:17Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - Mitigating False-Negative Contexts in Multi-document QuestionAnswering
with Retrieval Marginalization [29.797379277423143]
We develop a new parameterization of set-valued retrieval that properly handles unanswerable queries.
We show that marginalizing over this set during training allows a model to mitigate false negatives in annotated supporting evidences.
On IIRC, we show that joint modeling with marginalization on alternative contexts improves model performance by 5.5 F1 points and achieves a new state-of-the-art performance of 50.6 F1.
arXiv Detail & Related papers (2021-03-22T23:44:35Z) - Challenges in Information-Seeking QA: Unanswerable Questions and
Paragraph Retrieval [46.3246135936476]
We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise.
Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction.
We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer.
arXiv Detail & Related papers (2020-10-22T17:48:17Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.