Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
- URL: http://arxiv.org/abs/2508.02045v1
- Date: Mon, 04 Aug 2025 04:27:06 GMT
- Title: Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models
- Authors: Soyeon Kim, Jindong Wang, Xing Xie, Steven Euijong Whang,
- Abstract summary: TDBench is a new benchmark that systematically constructs Time-Sensitive Question-Answering pairs.<n>Fine-grained evaluation metric called time accuracy assesses validity of time references in model explanations.<n> experiments on contemporary Large Language Models show how ours enables scalable and comprehensive TSQA evaluation.
- Score: 38.12930048471948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Facts evolve over time, making it essential for Large Language Models (LLMs) to handle time-sensitive factual knowledge accurately and reliably. While factual Time-Sensitive Question-Answering (TSQA) tasks have been widely studied, existing benchmarks often rely on manual curation or a small, fixed set of predefined templates, which restricts scalable and comprehensive TSQA evaluation. To address these challenges, we propose TDBench, a new benchmark that systematically constructs TSQA pairs by harnessing temporal databases and database techniques such as temporal SQL and functional dependencies. We also introduce a fine-grained evaluation metric called time accuracy, which assesses the validity of time references in model explanations alongside traditional answer accuracy to enable a more reliable TSQA evaluation. Extensive experiments on contemporary LLMs show how \ours{} enables scalable and comprehensive TSQA evaluation while reducing the reliance on human labor, complementing existing Wikipedia/Wikidata-based TSQA evaluation approaches by enabling LLM evaluation on application-specific data and seamless multi-hop question generation. Code and data are publicly available at: https://github.com/ssoy0701/tdbench.git.
Related papers
- The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z) - Evaluating List Construction and Temporal Understanding capabilities of Large Language Models [54.39278049092508]
Large Language Models (LLMs) are susceptible to hallucinations and errors on particularly temporal understanding tasks.<n>We propose the Time referenced List based Question Answering (TLQA) benchmark that requires structured answers in list format aligned with corresponding time periods.<n>We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings.
arXiv Detail & Related papers (2025-06-26T21:40:58Z) - Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time [0.0]
In real-world scenarios, the correctness of answers is frequently tied to temporal context.<n>We present a novel framework and dataset spanning over 8,000 events from 2018 to 2024.<n>Our work provides a significant step toward advancing time-aware language models.
arXiv Detail & Related papers (2024-09-20T08:57:20Z) - UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization [34.257914212541394]
This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering benchmark.<n>It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real world.<n>It requires large language models (LLMs) to engage in genuine temporal reasoning without depending on the factual knowledge acquired during the pre-training phase.
arXiv Detail & Related papers (2024-07-03T22:02:07Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Self-Improvement Programming for Temporal Knowledge Graph Question Answering [31.33908040172437]
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer questions with temporal intent over Temporal Knowledge Graphs (TKGs)
Existing end-to-end methods implicitly model the time constraints by learning time-aware embeddings of questions and candidate answers.
We introduce a novel self-improvement Programming method for TKGQA (Prog-TQA)
arXiv Detail & Related papers (2024-04-02T08:14:27Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - A Benchmark for Generalizable and Interpretable Temporal Question
Answering over Knowledge Bases [67.33560134350427]
TempQA-WD is a benchmark dataset for temporal reasoning.
It is based on Wikidata, which is the most frequently curated, openly available knowledge base.
arXiv Detail & Related papers (2022-01-15T08:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.