Related papers: Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models

URL: http://arxiv.org/abs/2405.09454v1
Date: Wed, 15 May 2024 15:49:06 GMT
Title: Tell Me Why: Explainable Public Health Fact-Checking with Large Language Models
Authors: Majid Zarharan, Pascal Wullschleger, Babak Behkam Kia, Mohammad Taher Pilehvar, Jennifer Foster,
Abstract summary: This paper focuses on the ability of large language models to verify public health claims. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models.
Score: 21.280725490520798
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a comprehensive analysis of explainable fact-checking through a series of experiments, focusing on the ability of large language models to verify public health claims and provide explanations or justifications for their veracity assessments. We examine the effectiveness of zero/few-shot prompting and parameter-efficient fine-tuning across various open and closed-source models, examining their performance in both isolated and joint tasks of veracity prediction and explanation generation. Importantly, we employ a dual evaluation approach comprising previously established automatic metrics and a novel set of criteria through human evaluation. Our automatic evaluation indicates that, within the zero-shot scenario, GPT-4 emerges as the standout performer, but in few-shot and parameter-efficient fine-tuning contexts, open-source models demonstrate their capacity to not only bridge the performance gap but, in some instances, surpass GPT-4. Human evaluation reveals yet more nuance as well as indicating potential problems with the gold explanations.

Related papers

MagiC: Evaluating Multimodal Cognition Toward Grounded Visual Reasoning [15.17428354380373]
We introduce MagiC, a comprehensive benchmark designed to evaluate grounded multimodal cognition.<n>We evaluate 15 vision-language models ranging from 7B to 70B parameters across four dimensions: final answer correctness, reasoning validity, grounding fidelity, and self-correction ability.
arXiv Detail & Related papers (2025-07-09T21:44:12Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [69.68265487134686]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features. Answers are crafted as unambiguous and definitively correct in a short format.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data. We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods. The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z)
Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews. We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform. We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z)
Comparative Insights from 12 Machine Learning Models in Extracting Economic Ideology from Political Text [0.0]
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels.
arXiv Detail & Related papers (2025-01-16T18:06:22Z)
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models [51.067146460271466]
Evaluation of visual generative models can be time-consuming and computationally expensive. We propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools.
arXiv Detail & Related papers (2024-12-10T18:52:39Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Fine-tuning Large Language Models for Automated Diagnostic Screening Summaries [0.024105148723769353]
We evaluate several state-of-the-art Large Language Models (LLMs) for generating concise summaries from mental state examinations. We rigorously evaluate four different models for summary generation using established ROUGE metrics and input from human evaluators. Our top-performing fine-tuned model outperforms existing models, achieving ROUGE-1 and ROUGE-L values of 0.810 and 0.764, respectively.
arXiv Detail & Related papers (2024-03-29T12:25:37Z)
Zero-Shot Multi-task Hallucination Detection [8.539639901976594]
hallucination is an emergent condition in the model where generated text lacks faithfulness to the source. We formally define hallucination and propose a framework for its quantitative detection in a zero-shot setting. In detecting hallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting.
arXiv Detail & Related papers (2024-03-18T20:50:26Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation. SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z)
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics. We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.