When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
- URL: http://arxiv.org/abs/2510.07238v1
- Date: Wed, 08 Oct 2025 17:06:07 GMT
- Title: When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
- Authors: Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu,
- Abstract summary: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks.<n>We present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years.
- Score: 22.392925812111354
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.
Related papers
- On Robustness and Reliability of Benchmark-Based Evaluation of LLMs [6.121856629864516]
Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag.<n>Real-world applications involve linguistic variability, requiring models to maintain their effectiveness across diverse rewordings of the same question or query.<n>We systematically assess the robustness of LLMs to paraphrased benchmark questions and investigate whether benchmark-based evaluations provide a reliable measure of model capabilities.
arXiv Detail & Related papers (2025-09-04T08:43:27Z) - How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework [8.76693832650115]
Overestimation in evaluating large language models (LLMs) has become an increasing concern.<n>We propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography.
arXiv Detail & Related papers (2025-07-25T12:39:03Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.<n>It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.<n>We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks.
It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.