A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2508.12282v1
- Date: Sun, 17 Aug 2025 08:12:59 GMT
- Title: A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation
- Authors: Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang, Baotian Hu, Dawei Yin,
- Abstract summary: ChronoQA is a large-scale benchmark dataset for Chinese question answering.<n>It contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions.
- Score: 40.00268164578221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.
Related papers
- It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks [87.7937890373758]
Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation.<n>We introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks.<n>We propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels.
arXiv Detail & Related papers (2026-02-12T16:31:01Z) - GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z) - Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models [38.12930048471948]
TDBench is a new benchmark that systematically constructs Time-Sensitive Question-Answering pairs.<n>Fine-grained evaluation metric called time accuracy assesses validity of time references in model explanations.<n> experiments on contemporary Large Language Models show how ours enables scalable and comprehensive TSQA evaluation.
arXiv Detail & Related papers (2025-08-04T04:27:06Z) - The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z) - Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation [69.45495166424642]
We develop a robust and discriminative QA benchmark to measure temporal, causal, and character consistency understanding in narrative documents.<n>We then introduce Entity-Event RAG (E2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping.<n>Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries.
arXiv Detail & Related papers (2025-06-06T10:07:21Z) - It's High Time: A Survey of Temporal Question Answering [17.07150094603319]
Temporal Question Answering (TQA) focuses on answering questions involving temporal constraints or context.<n>Recent advances in TQA enabled by neural models and Large Language Models (LLMs)<n> benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.
arXiv Detail & Related papers (2025-05-26T17:21:26Z) - TempRetriever: Fusion-based Temporal Dense Passage Retrieval for Time-Sensitive Questions [18.87473448633352]
We propose TempRetriever, which explicitly incorporates temporal information by embedding both the query date and document timestamp into the retrieval process.<n> TempRetriever achieves a 6.63% improvement in Top-1 retrieval accuracy and a 3.79% improvement in NDCG@10 compared to the standard DPR on ArchivalQA.<n>We also propose a novel, time-based negative sampling strategy which further enhances retrieval performance by addressing temporal misalignment during training.
arXiv Detail & Related papers (2025-02-28T13:06:25Z) - Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement [55.2439260314328]
Time Series Multi-Task Question Answering (Time-MQA) is a unified framework that enables natural language queries across multiple time series tasks.<n>Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $sim $200k question-answer pairs.
arXiv Detail & Related papers (2025-02-26T13:47:13Z) - TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering [24.046966640011124]
ComplexTempQA is a large-scale dataset consisting of over 100 million question-answer pairs.
The dataset covers questions spanning over two decades and offers an unmatched breadth of topics.
arXiv Detail & Related papers (2024-06-07T12:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.