When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
- URL: http://arxiv.org/abs/2510.19172v1
- Date: Wed, 22 Oct 2025 02:12:32 GMT
- Title: When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
- Authors: Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah,
- Abstract summary: We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge.<n>Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates.<n>We demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
- Score: 11.701030951844222
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: LLMs often fail to handle temporal knowledge conflicts--contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
Related papers
- SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z) - Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLMs [0.0]
Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning.<n> Updating this knowledge typically requires costly and brittle re-training.<n>We propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training.
arXiv Detail & Related papers (2025-06-08T20:13:33Z) - EvoWiki: Evaluating LLMs on Evolving Knowledge [72.92365627254063]
EvoWiki is an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states.<n>Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses.<n>EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.
arXiv Detail & Related papers (2024-12-18T08:04:57Z) - ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains [19.428141279030527]
ChroKnowBench is a benchmark dataset designed to evaluate chronologically accumulated knowledge.<n>ChroKnowledge is a novel sampling-based framework for evaluating LLMs' non-parametric chronological knowledge.<n>ChroKnowPrompt is an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans.
arXiv Detail & Related papers (2024-10-13T15:08:49Z) - Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z) - DyKnow: Dynamically Verifying Time-Sensitive Factual Knowledge in LLMs [1.7764955091415962]
We present an approach to dynamically evaluate the knowledge in LLMs and their time-sensitiveness against Wikidata.
We evaluate the time-sensitive knowledge in twenty-four private and open-source LLMs, as well as the effectiveness of four editing methods in updating the outdated facts.
Our results show that 1) outdatedness is a critical problem across state-of-the-art LLMs; 2) LLMs output inconsistent answers when prompted with slight variations of the question prompt; and 3) the performance of the state-of-the-art knowledge editing algorithms is very limited.
arXiv Detail & Related papers (2024-04-10T18:08:59Z) - KnowTuning: Knowledge-aware Fine-tuning for Large Language Models [83.5849717262019]
We propose a knowledge-aware fine-tuning (KnowTuning) method to improve fine-grained and coarse-grained knowledge awareness of LLMs.
KnowTuning generates more facts with less factual error rate under fine-grained facts evaluation.
arXiv Detail & Related papers (2024-02-17T02:54:32Z) - Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration [39.603649838876294]
We study approaches to identify LLM knowledge gaps and abstain from answering questions when knowledge gaps are present.
Motivated by their failures in self-reflection and over-reliance on held-out sets, we propose two novel approaches.
arXiv Detail & Related papers (2024-02-01T06:11:49Z) - A Comprehensive Study of Knowledge Editing for Large Language Models [82.65729336401027]
Large Language Models (LLMs) have shown extraordinary capabilities in understanding and generating text that closely mirrors human communication.
This paper defines the knowledge editing problem and provides a comprehensive review of cutting-edge approaches.
We introduce a new benchmark, KnowEdit, for a comprehensive empirical evaluation of representative knowledge editing approaches.
arXiv Detail & Related papers (2024-01-02T16:54:58Z) - RECALL: A Benchmark for LLMs Robustness against External Counterfactual
Knowledge [69.79676144482792]
This study aims to evaluate the ability of LLMs to distinguish reliable information from external knowledge.
Our benchmark consists of two tasks, Question Answering and Text Generation, and for each task, we provide models with a context containing counterfactual information.
arXiv Detail & Related papers (2023-11-14T13:24:19Z) - DocTER: Evaluating Document-based Knowledge Editing [53.14000724633775]
We explore knowledge editing using easily accessible documents instead of manually labeled factual triples.<n>A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer.<n>Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples.
arXiv Detail & Related papers (2023-08-19T09:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.