Preserving Historical Truth: Detecting Historical Revisionism in Large Language Models
- URL: http://arxiv.org/abs/2602.17433v2
- Date: Sun, 22 Feb 2026 18:42:06 GMT
- Title: Preserving Historical Truth: Detecting Historical Revisionism in Large Language Models
- Authors: Francesco Ortu, Joeun Yook, Punya Syon Pandey, Keenan Samway, Bernhard Schölkopf, Alberto Cazzaniga, Rada Mihalcea, Zhijing Jin,
- Abstract summary: We introduce textttHistoricalMisinfo, a curated dataset of $500$ contested events from $45$ countries.<n>To approximate real-world usage, we instantiate each event in $11$ prompt scenarios that reflect common communication settings.
- Score: 66.75310318710073
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models (LLMs) are increasingly used as sources of historical information, motivating the need for scalable audits on contested events and politically charged narratives in settings that mirror real user interactions. We introduce \texttt{HistoricalMisinfo, a curated dataset of $500$ contested events from $45$ countries, each paired with a factual reference narrative and a documented revisionist reference narrative. To approximate real-world usage, we instantiate each event in $11$ prompt scenarios that reflect common communication settings (e.g., questions, textbooks, social posts, policy briefs). Using an LLM-as-a-judge protocol that compares model outputs to the two references, we evaluate LLMs varying across model architectures in two conditions: (i) neutral user prompts that ask for factually accurate information, and (ii) robustness prompts in which the user explicitly requests the revisionist version of the event. Under neutral prompts, models are generally closer to factual references, though the resulting scores should be interpreted as reference-alignment signals rather than definitive evidence of human-interpretable revisionism. Robustness prompting yields a strong and consistent effect: when the user requests the revisionist narrative, all evaluated models show sharply higher revisionism scores, indicating limited resistance or self-correction. HistoricalMisinfo provides a practical foundation for benchmarking robustness to revisionist framing and for guiding future work on more precise automatic evaluation of contested historical claims to ensure a sustainable integration of AI systems within society. Our code is available at https://github.com/francescortu/PreservingHistoricalTruth
Related papers
- VAL-Bench: Measuring Value Alignment in Language Models [10.745372809345412]
Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions.<n>Existing benchmarks mostly track refusals or predefined safety violations but do not reveal whether a model upholds a coherent value system.<n>We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired that frame opposing sides of public debates.
arXiv Detail & Related papers (2025-10-06T23:55:48Z) - Leveraging Large Language Models for Predictive Analysis of Human Misery [1.2458057399345226]
This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores.<n>We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting.<n>To move beyond static evaluation, we introduce the "Misery Game Show", a novel gamified framework inspired by a television format.
arXiv Detail & Related papers (2025-08-18T07:02:59Z) - AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models [12.515874333424929]
AssertBench addresses how directional framing of factually true statements influences model agreement.<n>We construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect.<n>We then record the model's agreement and reasoning.
arXiv Detail & Related papers (2025-06-08T14:08:22Z) - TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models [16.857263524133284]
Large Language Models (LLMs) are increasingly integrated into real-world, autonomous applications.<n> relying on static, pre-annotated references for evaluation poses significant challenges in cost, scalability, and completeness.<n>We propose Tool-Augmented LLM Evaluation (TALE), a framework to assess LLM outputs without predetermined ground-truth answers.
arXiv Detail & Related papers (2025-04-10T02:08:41Z) - Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings [36.449658676568234]
Large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs.<n>We propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios.<n>Our comprehensive study reveals that the contextual information and its assessment criteria present a significant challenge to even state-of-the-art models.
arXiv Detail & Related papers (2025-03-19T18:09:19Z) - Analysing Zero-Shot Readability-Controlled Sentence Simplification [54.09069745799918]
We investigate how different types of contextual information affect a model's ability to generate sentences with the desired readability.<n>Results show that all tested models struggle to simplify sentences due to models' limitations and characteristics of the source sentences.<n>Our experiments also highlight the need for better automatic evaluation metrics tailored to RCTS.
arXiv Detail & Related papers (2024-09-30T12:36:25Z) - FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows" [74.7488607599921]
FaithEval is a benchmark to evaluate the faithfulness of large language models (LLMs) in contextual scenarios.<n>FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework.<n>Our study reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.
arXiv Detail & Related papers (2024-09-30T06:27:53Z) - Belief Revision: The Adaptability of Large Language Models Reasoning [63.0281286287648]
We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence.
Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning framework.
We evaluate $sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information.
arXiv Detail & Related papers (2024-06-28T09:09:36Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Towards Personalized Review Summarization by Modeling Historical Reviews
from Customer and Product Separately [59.61932899841944]
Review summarization is a non-trivial task that aims to summarize the main idea of the product review in the E-commerce website.
We propose the Heterogeneous Historical Review aware Review Summarization Model (HHRRS)
We employ a multi-task framework that conducts the review sentiment classification and summarization jointly.
arXiv Detail & Related papers (2023-01-27T12:32:55Z) - $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
via Question Generation and Question Answering [38.951535576102906]
We propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models.
Our metric makes use of co-reference resolution and natural language inference capabilities.
We curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset.
arXiv Detail & Related papers (2021-04-16T16:21:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.