Related papers: Language Models Struggle to Achieve a Consistent Temporal Representation of Facts

Language Models Struggle to Achieve a Consistent Temporal Representation of Facts

URL: http://arxiv.org/abs/2502.01220v2
Date: Mon, 17 Feb 2025 13:20:37 GMT
Title: Language Models Struggle to Achieve a Consistent Temporal Representation of Facts
Authors: Hichem Ammar Khodja, Frédéric Béchet, Quentin Brabant, Alexis Nasr, Gwénolé Lecorvé,
Abstract summary: We introduce TimeStress, a novel dataset comprising 521K statements on 2003 of the most popular temporal facts in Wikidata.<n>Each statement contextualizes a fact with correct and incorrect dates across three precisions (Day, Month, Year)<n>We evaluate LMs' ability to discern between correct and incorrect temporal statements based on their probability of being generated.
Score: 3.6921454547718784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language Models (LMs) have shown substantial improvements in handling factual knowledge, yet their capability to consistently represent temporal facts, which are valid only within specific timeframes, remains underexplored. To investigate this, we introduce TimeStress, a novel dataset comprising 521K statements on 2003 of the most popular temporal facts in Wikidata. Each statement contextualizes a fact with correct and incorrect dates across three precisions (Day, Month, Year). This setup allows us to evaluate LMs' ability to discern between correct and incorrect temporal statements based on their probability of being generated. We assess 18 LMs across various architectures using two metrics: the win rate, indicating how often correct dates outperform incorrect ones, and robustness, reflecting consistent performance across all dates. Our findings reveal that while some LMs achieve a win rate exceeding 80\%, robustness remains low, with the best model achieving only 6\%. Furthermore, robust knowledge at one date precision does not reliably transfer to others, highlighting a significant generalization gap. These results underscore the struggle of LMs to maintain a consistent temporal representation, supporting their limitations as reliable sources of temporal knowledge. We provide all data and code for further research.

Related papers

A Study into Investigating Temporal Robustness of LLMs [19.067901534284395]
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. We aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information. We show how a selection of these eight tests can be used automatically to judge a model's temporal robustness for user questions on the fly.
arXiv Detail & Related papers (2025-03-21T11:56:17Z)
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators. It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z)
Learning and Unlearning of Fabricated Knowledge in Language Models [16.971082623826263]
We show that facts that conflict with common knowledge are remembered for tens of thousands of training steps. We show that impacts of knowledge-conflicting facts in LMs, though they can be long lasting, can be largely erased by novel application of multi-step sparse updates.
arXiv Detail & Related papers (2024-10-29T05:33:14Z)
ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains [19.428141279030527]
Large language models (LLMs) have brought significant changes to many aspects of our lives.<n>Existing approaches fall short in addressing the temporal adaptability of knowledge.<n>We present ChroKnowledge, a novel sampling-based framework for evaluating LLMs' non-parametric chronological knowledge.
arXiv Detail & Related papers (2024-10-13T15:08:49Z)
STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis [12.582867572800488]
Large language models (LLMs) hold promise for reforming the methodology of rapid rapid evolution of large language models. This paper builds the benchmark dataset STBench, containing 13 distinct computation tasks and over 60,000 QA pairs. Experimental results reveal that existing LLMs show remarkable performance on knowledge comprehension and distinct-temporal reasoning tasks.
arXiv Detail & Related papers (2024-06-27T10:34:02Z)
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression [19.69104070561701]
Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy.
arXiv Detail & Related papers (2024-05-01T03:50:09Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
MuLan: A Study of Fact Mutability in Language Models [50.626787909759976]
Trustworthy language models ideally identify mutable facts as such and process them accordingly. We create MuLan, a benchmark for evaluating the ability of English language models to anticipate time-contingency.
arXiv Detail & Related papers (2024-04-03T19:47:33Z)
Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z)
A Glitch in the Matrix? Locating and Detecting Language Model Grounding with Fakepedia [57.31074448586854]
Large language models (LLMs) have an impressive ability to draw on novel information supplied in their context. Yet the mechanisms underlying this contextual grounding remain unknown. We present a novel method to study grounding abilities using Fakepedia.
arXiv Detail & Related papers (2023-12-04T17:35:42Z)
Do Large Language Models Know about Facts? [60.501902866946]
Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. We aim to evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages.
arXiv Detail & Related papers (2023-10-08T14:26:55Z)
Mitigating Temporal Misalignment by Discarding Outdated Facts [58.620269228776294]
Large language models are often used under temporal misalignment, tasked with answering questions about the present. We propose fact duration prediction: the task of predicting how long a given fact will remain true. Our data and code are released publicly at https://github.com/mikejqzhang/mitigating_misalignment.
arXiv Detail & Related papers (2023-05-24T07:30:08Z)
Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts) We find that existing methods for updating knowledge show little propagation of injected knowledge. Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z)
The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems [87.3207729953778]
We evaluate state-of-the-art coreference resolution models on our dataset. Several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.
arXiv Detail & Related papers (2022-12-15T23:26:54Z)
Factuality Enhanced Language Models for Open-Ended Text Generation [60.27166549575472]
We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. We find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion.
arXiv Detail & Related papers (2022-06-09T17:16:43Z)
The Language Model Understood the Prompt was Ambiguous: Probing Syntactic Uncertainty Through Generation [23.711953448400514]
We inspect to which extent neural language models (LMs) exhibit uncertainty over such analyses. We find that LMs can track multiple analyses simultaneously. As a response to disambiguating cues, the LMs often select the correct interpretation, but occasional errors point to potential areas of improvement.
arXiv Detail & Related papers (2021-09-16T10:27:05Z)
Time-Aware Language Models as Temporal Knowledge Bases [39.00042720454899]
Language models (LMs) are trained on snapshots of data collected at a specific moment in time. We introduce a diagnostic dataset aimed at probing LMs for factual knowledge that changes over time. We propose a simple technique for jointly modeling text with its timestamp.
arXiv Detail & Related papers (2021-06-29T06:18:57Z)
Probing Across Time: What Does RoBERTa Know and When? [70.20775905353794]
We show that linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. We believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.
arXiv Detail & Related papers (2021-04-16T04:26:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.