Language Models Struggle to Achieve a Consistent Temporal Representation of Facts
- URL: http://arxiv.org/abs/2502.01220v2
- Date: Mon, 17 Feb 2025 13:20:37 GMT
- Title: Language Models Struggle to Achieve a Consistent Temporal Representation of Facts
- Authors: Hichem Ammar Khodja, Frédéric Béchet, Quentin Brabant, Alexis Nasr, Gwénolé Lecorvé,
- Abstract summary: We introduce TimeStress, a novel dataset comprising 521K statements on 2003 of the most popular temporal facts in Wikidata.
Each statement contextualizes a fact with correct and incorrect dates across three precisions (Day, Month, Year)
We evaluate LMs' ability to discern between correct and incorrect temporal statements based on their probability of being generated.
- Score: 3.6921454547718784
- License:
- Abstract: Language Models (LMs) have shown substantial improvements in handling factual knowledge, yet their capability to consistently represent temporal facts, which are valid only within specific timeframes, remains underexplored. To investigate this, we introduce TimeStress, a novel dataset comprising 521K statements on 2003 of the most popular temporal facts in Wikidata. Each statement contextualizes a fact with correct and incorrect dates across three precisions (Day, Month, Year). This setup allows us to evaluate LMs' ability to discern between correct and incorrect temporal statements based on their probability of being generated. We assess 18 LMs across various architectures using two metrics: the win rate, indicating how often correct dates outperform incorrect ones, and robustness, reflecting consistent performance across all dates. Our findings reveal that while some LMs achieve a win rate exceeding 80\%, robustness remains low, with the best model achieving only 6\%. Furthermore, robust knowledge at one date precision does not reliably transfer to others, highlighting a significant generalization gap. These results underscore the struggle of LMs to maintain a consistent temporal representation, supporting their limitations as reliable sources of temporal knowledge. We provide all data and code for further research.
Related papers
- LLMs as Repositories of Factual Knowledge: Limitations and Solutions [1.7764955091415962]
We study the appropriateness of Large Language Models (LLMs) as repositories of factual knowledge.
We evaluate their reliability in responding to time-sensitive factual questions.
We propose "ENtity-Aware Fine-tuning" (ENAF) to improve the model's performance.
arXiv Detail & Related papers (2025-01-22T10:16:53Z) - ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding.
We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently.
Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - FactAlign: Long-form Factuality Alignment of Large Language Models [35.067998820937284]
Large language models have demonstrated significant potential as the next-generation information access engines.
We propose FactAlign, a novel alignment framework designed to enhance the factuality of long-form responses.
Our experiments on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM responses.
arXiv Detail & Related papers (2024-10-02T16:03:13Z) - MuLan: A Study of Fact Mutability in Language Models [50.626787909759976]
Trustworthy language models ideally identify mutable facts as such and process them accordingly.
We create MuLan, a benchmark for evaluating the ability of English language models to anticipate time-contingency.
arXiv Detail & Related papers (2024-04-03T19:47:33Z) - BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability [35.743903178120895]
BaRDa dataset contains 3000 entailments (1787 valid, 1213 invalid)
We find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2.
This shows the clear progression of models towards improved factual accuracy and entailment reasoning.
arXiv Detail & Related papers (2023-12-12T18:55:43Z) - Mitigating Temporal Misalignment by Discarding Outdated Facts [58.620269228776294]
Large language models are often used under temporal misalignment, tasked with answering questions about the present.
We propose fact duration prediction: the task of predicting how long a given fact will remain true.
Our data and code are released publicly at https://github.com/mikejqzhang/mitigating_misalignment.
arXiv Detail & Related papers (2023-05-24T07:30:08Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Zero-shot Faithful Factual Error Correction [53.121642212060536]
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models.
We present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence.
arXiv Detail & Related papers (2023-05-13T18:55:20Z) - Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift
with Multiple Views [24.470873436741073]
We benchmark $11$ pretrained language models (MLMs) on a series of tests designed to evaluate effect of temporal concept drift.
Specifically, we provide a holistic framework that dynamically creates temporal test sets of any time.
arXiv Detail & Related papers (2023-02-23T19:24:55Z) - A Dataset for Answering Time-Sensitive Questions [88.95075983560331]
Time is an important dimension in our physical world. Lots of facts can evolve with respect to time.
It is important to consider the time dimension and empower the existing QA models to reason over time.
The existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability.
arXiv Detail & Related papers (2021-08-13T16:42:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.