Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models
- URL: http://arxiv.org/abs/2405.09454v1
- Date: Wed, 15 May 2024 15:49:06 GMT
- Title: Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models
- Authors: Majid Zarharan, Pascal Wullschleger, Babak Behkam Kia, Mohammad Taher Pilehvar, Jennifer Foster,
- Abstract summary: This paper focuses on the ability of large language models to verify public health claims.
We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models.
- Score: 21.280725490520798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.
Related papers
- Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.
We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.
The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z) - Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews.
We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform.
We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z) - Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text [0.0]
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology.
The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels.
arXiv Detail & Related papers (2025-01-16T18:06:22Z) - Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive.
We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations.
It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Zero-Shot Multi-task Hallucination Detection [8.539639901976594]
hallucination is an emergent condition in the model where generated text lacks faithfulness to the source.
We formally define hallucination and propose a framework for its quantitative detection in a zero-shot setting.
In detecting hallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting.
arXiv Detail & Related papers (2024-03-18T20:50:26Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.