Vera: A General-Purpose Plausibility Estimation Model for Commonsense
Statements
- URL: http://arxiv.org/abs/2305.03695v3
- Date: Wed, 18 Oct 2023 14:48:51 GMT
- Title: Vera: A General-Purpose Plausibility Estimation Model for Commonsense
Statements
- Authors: Jiacheng Liu, Wenya Wang, Dianzhuo Wang, Noah A. Smith, Yejin Choi,
Hannaneh Hajishirzi
- Abstract summary: We introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge.
trained on 7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases.
We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.
- Score: 135.09277663808322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the much discussed capabilities of today's language models, they are
still prone to silly and unexpected commonsense failures. We consider a
retrospective verification approach that reflects on the correctness of LM
outputs, and introduce Vera, a general-purpose model that estimates the
plausibility of declarative statements based on commonsense knowledge. Trained
on ~7M commonsense statements created from 19 QA datasets and two large-scale
knowledge bases, and with a combination of three training objectives, Vera is a
versatile model that effectively separates correct from incorrect statements
across diverse commonsense domains. When applied to solving commonsense
problems in the verification format, Vera substantially outperforms existing
models that can be repurposed for commonsense verification, and it further
exhibits generalization capabilities to unseen tasks and provides
well-calibrated outputs. We find that Vera excels at filtering LM-generated
commonsense knowledge and is useful in detecting erroneous commonsense
statements generated by models like ChatGPT in real-world settings.
Related papers
- FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.
FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance.
Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate.
This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z) - Commonsense Knowledge Transfer for Pre-trained Language Models [83.01121484432801]
We introduce commonsense knowledge transfer, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model.
It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model.
It then refines the language model with two self-supervised objectives: commonsense mask infilling and commonsense relation prediction.
arXiv Detail & Related papers (2023-06-04T15:44:51Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - FactKB: Generalizable Factuality Evaluation using Language Models
Enhanced with Factual Knowledge [37.2179237007464]
We propose FactKB, a simple new approach to factuality evaluation that is generalizable across domains.
We introduce three types of complementary factuality pretraining objectives based on direct entity facts, facts grounded in auxiliary knowledge about entities, and facts constructed compositionally through knowledge base walks.
The resulting factuality evaluation model achieves state-of-the-art performance on two in-domain news summarization benchmarks and on three out-of-domain scientific literature datasets.
arXiv Detail & Related papers (2023-05-14T23:58:05Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.