Related papers: A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation

URL: http://arxiv.org/abs/2508.12282v1
Date: Sun, 17 Aug 2025 08:12:59 GMT
Title: A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation
Authors: Ziyang Chen, Erxue Min, Xiang Zhao, Yunxin Li, Xin Jia, Jinzhi Liao, Jichao Li, Shuaiqiang Wang, Baotian Hu, Dawei Yin,
Abstract summary: ChronoQA is a large-scale benchmark dataset for Chinese question answering.<n>It contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions.
Score: 40.00268164578221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.

Related papers

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks [87.7937890373758]
Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation.<n>We introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks.<n>We propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels.
arXiv Detail & Related papers (2026-02-12T16:31:01Z)
GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z)
Harnessing Temporal Databases for Systematic Evaluation of Factual Time-Sensitive Question-Answering in Large Language Models [38.12930048471948]
TDBench is a new benchmark that systematically constructs Time-Sensitive Question-Answering pairs.<n>Fine-grained evaluation metric called time accuracy assesses validity of time references in model explanations.<n> experiments on contemporary Large Language Models show how ours enables scalable and comprehensive TSQA evaluation.
arXiv Detail & Related papers (2025-08-04T04:27:06Z)
The benefits of query-based KGQA systems for complex and temporal questions in LLM era [55.20230501807337]
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions.<n> Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers.<n>We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks.
arXiv Detail & Related papers (2025-07-16T06:41:03Z)
Respecting Temporal-Causal Consistency: Entity-Event Knowledge Graphs for Retrieval-Augmented Generation [69.45495166424642]
We develop a robust and discriminative QA benchmark to measure temporal, causal, and character consistency understanding in narrative documents.<n>We then introduce Entity-Event RAG (E2RAG), a dual-graph framework that keeps separate entity and event subgraphs linked by a bipartite mapping.<n>Across ChronoQA, our approach outperforms state-of-the-art unstructured and KG-based RAG baselines, with notable gains on causal and character consistency queries.
arXiv Detail & Related papers (2025-06-06T10:07:21Z)
It's High Time: A Survey of Temporal Question Answering [17.07150094603319]
Temporal Question Answering (TQA) focuses on answering questions involving temporal constraints or context.<n>Recent advances in TQA enabled by neural models and Large Language Models (LLMs)<n> benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.
arXiv Detail & Related papers (2025-05-26T17:21:26Z)
TempRetriever: Fusion-based Temporal Dense Passage Retrieval for Time-Sensitive Questions [18.87473448633352]
We propose TempRetriever, which explicitly incorporates temporal information by embedding both the query date and document timestamp into the retrieval process.<n> TempRetriever achieves a 6.63% improvement in Top-1 retrieval accuracy and a 3.79% improvement in NDCG@10 compared to the standard DPR on ArchivalQA.<n>We also propose a novel, time-based negative sampling strategy which further enhances retrieval performance by addressing temporal misalignment during training.
arXiv Detail & Related papers (2025-02-28T13:06:25Z)
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement [55.2439260314328]
Time Series Multi-Task Question Answering (Time-MQA) is a unified framework that enables natural language queries across multiple time series tasks.<n>Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $sim $200k question-answer pairs.
arXiv Detail & Related papers (2025-02-26T13:47:13Z)
TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z)
ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering [24.046966640011124]
ComplexTempQA is a large-scale dataset consisting of over 100 million question-answer pairs. The dataset covers questions spanning over two decades and offers an unmatched breadth of topics.
arXiv Detail & Related papers (2024-06-07T12:01:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.