Related papers: Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

URL: http://arxiv.org/abs/2506.03655v1
Date: Wed, 04 Jun 2025 07:47:21 GMT
Title: Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
Authors: Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner,
Abstract summary: This paper introduces a novel, dynamically data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024.<n>We evaluate five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B.<n>Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability.
Score: 1.1135113962297134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.

Related papers

When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z)
Out of Style: RAG's Fragility to Linguistic Variation [29.59506089890902]
User queries exhibit greater linguistic variations and can trigger cascading errors across interdependent RAG components.<n>We analyze how varying four linguistic dimensions (formality, readability, politeness, and grammatical correctness) impact RAG performance.
arXiv Detail & Related papers (2025-04-11T03:30:26Z)
Fact-checking with Generative AI: A Systematic Cross-Topic Examination of LLMs Capacity to Detect Veracity of Political Information [0.0]
The purpose of this study is to assess how large language models (LLMs) can be used for fact-checking.<n>We use AI auditing methodology that systematically evaluates performance of five LLMs.<n>The results indicate that models are better at identifying false statements, especially on sensitive topics.
arXiv Detail & Related papers (2025-03-11T13:06:40Z)
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks.<n>We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models [1.985242455423935]
We show that fine-tuning Transformer models for fact-checking provide superior performance over large language models. We show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.
arXiv Detail & Related papers (2024-02-19T14:00:35Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
The Perils & Promises of Fact-checking with Large Language Models [55.869584426820715]
Large Language Models (LLMs) are increasingly trusted to write academic papers, lawsuits, and news articles. We evaluate the use of LLM agents in fact-checking by having them phrase queries, retrieve contextual data, and make decisions. Our results show the enhanced prowess of LLMs when equipped with contextual information. While LLMs show promise in fact-checking, caution is essential due to inconsistent accuracy.
arXiv Detail & Related papers (2023-10-20T14:49:47Z)
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z)
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model [0.0]
Large language models (LLMs) have made significant advancements in natural language processing (NLP) Training LLMs on focused corpora poses computational challenges. An alternative approach is to use a retrieval-augmentation (RetA) method tested in a specific domain. OpenAI's GPT-3, GPT-4, Bing's Prometheus, and a custom RetA model were compared using 19 questions on diffuse large B-cell lymphoma (DLBCL) disease.
arXiv Detail & Related papers (2023-05-26T17:33:05Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.