Posthoc Verification and the Fallibility of the Ground Truth
- URL: http://arxiv.org/abs/2106.07353v1
- Date: Wed, 2 Jun 2021 17:57:09 GMT
- Title: Posthoc Verification and the Fallibility of the Ground Truth
- Authors: Yifan Ding, Nicholas Botzer, Tim Weninger
- Abstract summary: We conduct a systematic posthoc verification experiment on the entity linking (EL) task.
Compared to pre-annotation evaluation, state-of-the-art EL models performed extremely well according to the posthoc evaluation methodology.
Surprisingly, we find predictions from EL models had a similar or higher verification rate than the ground truth.
- Score: 10.427125361534966
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classifiers commonly make use of pre-annotated datasets, wherein a model is
evaluated by pre-defined metrics on a held-out test set typically made of
human-annotated labels. Metrics used in these evaluations are tied to the
availability of well-defined ground truth labels, and these metrics typically
do not allow for inexact matches. These noisy ground truth labels and strict
evaluation metrics may compromise the validity and realism of evaluation
results. In the present work, we discuss these concerns and conduct a
systematic posthoc verification experiment on the entity linking (EL) task.
Unlike traditional methodologies, which asks annotators to provide free-form
annotations, we ask annotators to verify the correctness of annotations after
the fact (i.e., posthoc). Compared to pre-annotation evaluation,
state-of-the-art EL models performed extremely well according to the posthoc
evaluation methodology. Posthoc validation also permits the validation of the
ground truth dataset. Surprisingly, we find predictions from EL models had a
similar or higher verification rate than the ground truth. We conclude with a
discussion on these findings and recommendations for future evaluations.
Related papers
- FactLens: Benchmarking Fine-Grained Fact Verification [6.814173254027381]
We advocate for a shift toward fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification.
We introduce FactLens, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality.
Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
arXiv Detail & Related papers (2024-11-08T21:26:57Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Weak Supervision Performance Evaluation via Partial Identification [46.73061437177238]
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels.
We present a novel method to address this challenge by framing model evaluation as a partial identification problem.
Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.
arXiv Detail & Related papers (2023-12-07T07:15:11Z) - Evaluating AI systems under uncertain ground truth: a case study in
dermatology [44.80772162289557]
We propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation.
We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses.
arXiv Detail & Related papers (2023-07-05T10:33:45Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Interpretable Automatic Fine-grained Inconsistency Detection in Text
Summarization [56.94741578760294]
We propose the task of fine-grained inconsistency detection, the goal of which is to predict the fine-grained types of factual errors in a summary.
Motivated by how humans inspect factual inconsistency in summaries, we propose an interpretable fine-grained inconsistency detection model, FineGrainFact.
arXiv Detail & Related papers (2023-05-23T22:11:47Z) - Exploring validation metrics for offline model-based optimisation with
diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle.
While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples.
This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - GPM: A Generic Probabilistic Model to Recover Annotator's Behavior and
Ground Truth Labeling [34.48095564497967]
We propose a probabilistic graphical annotation model to infer the underlying ground truth and annotator's behavior.
The proposed model is able to identify whether an annotator has worked diligently towards the task during the labeling procedure.
arXiv Detail & Related papers (2020-03-01T12:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.