Related papers: Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

URL: http://arxiv.org/abs/2510.02340v2
Date: Wed, 15 Oct 2025 00:27:58 GMT
Title: Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs
Authors: Xin Gao, Ruiyi Zhang, Daniel Du, Saurabh Mahindre, Sai Ashish Somayajula, Pengtao Xie,
Abstract summary: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns.<n>We investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs.<n>Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query.
Score: 31.64130018833542
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

Related papers

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data [33.6173339938215]
Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary.<n>Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic.
arXiv Detail & Related papers (2026-01-29T03:32:09Z)
LLMLagBench: Identifying Temporal Training Boundaries in Large Language Models [0.0]
Large Language Models (LLMs) are pretrained on textual data up to a specific temporal cutoff.<n>LLMs may inadvertently blend outdated time-sensitive information with general knowledge during reasoning tasks.
arXiv Detail & Related papers (2025-11-15T09:08:10Z)
Realizing LLMs' Causal Potential Requires Science-Grounded, Novel Benchmarks [20.409472830397455]
Recent claims of strong performance by Large Language Models (LLMs) on causal discovery are undermined by a key flaw: many evaluations rely on benchmarks likely included in pretraining corpora.<n>We challenge this narrative by asking: Do LLMs truly reason about causal structure, and how can we measure it without memorization concerns?<n>We argue that realizing LLMs' potential for causal analysis requires two shifts: (P.1) developing robust evaluation protocols based on recent scientific studies to guard against dataset leakage, and (P.2) designing hybrid methods that combine LLM-derived knowledge with data-driven statistics.
arXiv Detail & Related papers (2025-10-18T14:58:04Z)
ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models [12.948099229475265]
Large language models (LLMs) face significant challenges in ex-ante reasoning.<n>Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff.<n>This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints.
arXiv Detail & Related papers (2025-05-26T05:39:57Z)
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs [48.202202256201815]
Factual hallucinations are a major challenge for Large Language Models (LLMs)<n>They undermine reliability and user trust by generating inaccurate or fabricated content.<n>Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness.
arXiv Detail & Related papers (2025-05-22T11:00:53Z)
LLM-based Query Expansion Fails for Unfamiliar and Ambiguous Queries [5.561044064438963]
Large language models (LLMs) offer an effective alternative to traditional rule-based and statistical methods.<n>Large language models (LLMs) offer an effective alternative to traditional rule-based and statistical methods.
arXiv Detail & Related papers (2025-05-19T04:33:09Z)
Inside-Out: Hidden Factual Knowledge in LLMs [50.79758420289131]
This work presents a framework for assessing whether large language models (LLMs) encode more factual knowledge in their parameters than what they express in their outputs.<n>We first propose a formal definition of knowledge, quantifying it for a given question as the fraction of correct-incorrect answer pairs where the correct one is ranked higher.<n>We then present a case study, applying this framework to three popular open-weights LLMs in a closed-book QA setup.
arXiv Detail & Related papers (2025-03-19T15:21:48Z)
Are LLMs Really Not Knowledgable? Mining the Submerged Knowledge in LLMs' Memory [15.986679553468989]
Large language models (LLMs) have shown promise as potential knowledge bases.<n>LLMs often struggle with question-answering tasks and are prone to hallucinations.<n>We develop SkipUnsure, a method to improve answer accuracy by leveraging detected but unexpressed knowledge.
arXiv Detail & Related papers (2024-12-30T10:29:18Z)
Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training.<n>Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization.<n>We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z)
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning [68.57166425493283]
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions.<n>This crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered.<n>We introduce Certainty Represented Knowledge Flow for Refusal-Aware Instructions Tuning (CRaFT) to address this issue.
arXiv Detail & Related papers (2024-10-09T14:12:51Z)
Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge. We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions. Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z)
Understanding Privacy Risks of Embeddings Induced by Large Language Models [75.96257812857554]
Large language models show early signs of artificial general intelligence but struggle with hallucinations. One promising solution is to store external knowledge as embeddings, aiding LLMs in retrieval-augmented generation. Recent studies experimentally showed that the original text can be partially reconstructed from text embeddings by pre-trained language models.
arXiv Detail & Related papers (2024-04-25T13:10:48Z)
Dated Data: Tracing Knowledge Cutoffs in Large Language Models [47.987664966633865]
We propose a simple approach to estimate effective cutoffs on the resource-level temporal alignment of an LLM. We find that effective cutoffs often differ from reported cutoffs. Our analysis reveals two reasons for these inconsistencies: (1) temporal biases of CommonCrawl data due to non-trivial amounts of old data in new dumps and (2) complications in LLM deduplication schemes involving semantic duplicates and lexical near-duplicates.
arXiv Detail & Related papers (2024-03-19T17:57:58Z)
DocTER: Evaluating Document-based Knowledge Editing [53.14000724633775]
We explore knowledge editing using easily accessible documents instead of manually labeled factual triples.<n>A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer.<n>Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples.
arXiv Detail & Related papers (2023-08-19T09:17:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.