Related papers: BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

URL: http://arxiv.org/abs/2312.07527v2
Date: Sat, 23 Mar 2024 21:43:04 GMT
Title: BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability
Authors: Peter Clark, Bhavana Dalvi Mishra, Oyvind Tafjord,
Abstract summary: BaRDa dataset contains 3000 entailments (1787 valid, 1213 invalid) We find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning.
Score: 35.743903178120895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.

Related papers

How Entangled is Factuality and Deception in German? [10.790059579736276]
Research on deception detection and fact checking often conflates factual accuracy with the truthfulness of statements. The belief-based deception framework disentangles these properties by defining texts as deceptive when there is a mismatch between what people say and what they truly believe. We test the effectiveness of computational models in detecting deception using an established corpus of belief-based argumentation.
arXiv Detail & Related papers (2024-09-30T10:23:13Z)
FactGenius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to Improve Fact Verification with Knowledge Graphs [0.0]
We present FactGenius, a novel method that enhances fact-checking by combining zero-shot prompting of large language models with fuzzy text matching on knowledge graphs. The evaluation of FactGenius on the FactKG, a benchmark dataset for fact verification, demonstrates that it significantly outperforms existing baselines.
arXiv Detail & Related papers (2024-06-03T13:24:37Z)
Pre-training and Diagnosing Knowledge Base Completion Models [58.07183284468881]
We introduce and analyze an approach to knowledge transfer from one collection of facts to another without the need for entity or relation matching. The main contribution is a method that can make use of large-scale pre-training on facts, which were collected from unstructured text. To understand the obtained pre-trained models better, we then introduce a novel dataset for the analysis of pre-trained models for Open Knowledge Base Completion.
arXiv Detail & Related papers (2024-01-27T15:20:43Z)
Noisy Positive-Unlabeled Learning with Self-Training for Speculative Knowledge Graph Reasoning [31.62771133978441]
This paper studies speculative reasoning task on real-world knowledge graphs (KG) that contain both textitfalse negative issue (i.e., potential true facts being excluded) and textitfalse positive issue (i.e., unreliable or outdated facts being included) We propose a variational framework, namely nPUGraph, that jointly estimates the correctness of both collected and uncollected facts.
arXiv Detail & Related papers (2023-06-13T02:43:21Z)
Zero-shot Faithful Factual Error Correction [53.121642212060536]
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. We present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence.
arXiv Detail & Related papers (2023-05-13T18:55:20Z)
Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks. We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving. This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics. Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets. Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z)
AmbiFC: Fact-Checking Ambiguous Claims with Evidence [57.7091560922174]
We present AmbiFC, a fact-checking dataset with 10k claims derived from real-world information needs. We analyze disagreements arising from ambiguity when comparing claims against evidence in AmbiFC. We develop models for predicting veracity handling this ambiguity via soft labels.
arXiv Detail & Related papers (2021-04-01T17:40:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.