Related papers: TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval

URL: http://arxiv.org/abs/2601.09523v1
Date: Wed, 14 Jan 2026 14:45:20 GMT
Title: TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval
Authors: Abdelrahman Abdallah, Mohammed Ali, Muhammad Abdul-Mageed, Adam Jatowt,
Abstract summary: Existing temporal QA benchmarks focus on fact-seeking queries from news corpora.<n>We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains.
Score: 44.94371780739013
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing temporal QA benchmarks focus on simple fact-seeking queries from news corpora, while reasoning-intensive retrieval benchmarks lack temporal grounding. However, real-world information needs often require reasoning about temporal evolution and synthesizing evidence across time periods. We introduce TEMPO, the first benchmark combining temporal reasoning with reasoning-intensive retrieval across 13 domains. TEMPO features: (1) 1,730 complex queries requiring deep temporal reasoning such as tracking changes, identifying trends, or comparing cross-period evidence; (2) step-wise retrieval planning with 3,976 decomposed steps and gold documents mapped to each step for multi-hop evaluation; and (3) novel temporal metrics including Temporal Coverage@k and Temporal Precision@k measuring whether results span required time periods. Evaluation of 12 retrieval systems reveals substantial challenges: the best model (DiVeR) achieves only 32.0 NDCG@10 and 71.4\% Temporal Coverage@10, demonstrating difficulty in retrieving temporally complete evidence. We believe TEMPO provides a challenging benchmark for improving temporal reasoning in retrieval and RAG systems. Our code and data are available at https://github.com/tempo-bench/Tempo. See also our official website: https://tempo-bench.github.io/.

Related papers

Memory-T1: Reinforcement Learning for Temporal Reasoning in Multi-session Agents [80.33280979339123]
We introduce Memory-T1, a framework that learns a time-aware memory selection policy using reinforcement learning (RL)<n>On the Time-Dialog benchmark, Memory-T1 boosts a 7B model to an overall score of 67.0%, establishing a new state-of-the-art performance for open-source models.
arXiv Detail & Related papers (2025-12-23T06:37:29Z)
Not in Sync: Unveiling Temporal Bias in Audio Chat Models [59.146710538620816]
Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning.<n>We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction.
arXiv Detail & Related papers (2025-10-14T06:29:40Z)
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models [105.47481207029047]
We introduce Time Series Reasoning Suite (TSR-Suite), which formalizes four atomic tasks that span three fundamental capabilities for reasoning with time series.<n>We also introduce Time Omni-1, the first unified reasoning model designed to address diverse real-world problems demanding time series reasoning.
arXiv Detail & Related papers (2025-09-29T13:54:34Z)
Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval [10.939002113975706]
Temporal Information Retrieval is a critical yet unresolved task for modern search systems.<n>Re3 is a framework that balances semantic and temporal information through a query-aware gating mechanism.<n>On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets.
arXiv Detail & Related papers (2025-09-01T09:44:01Z)
Reading Between the Timelines: RAG for Answering Diachronic Questions [8.969698902720799]
We propose a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic.<n>Our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%.<n>This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions.
arXiv Detail & Related papers (2025-07-21T05:19:41Z)
Temporal Information Retrieval via Time-Specifier Model Merging [9.690250070561461]
Time-Specifier Model Merging (TSM) is a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries.<n>Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries.
arXiv Detail & Related papers (2025-07-09T12:16:11Z)
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios [34.611056451149416]
We propose a benchmark TIME, designed for temporal reasoning in real-world scenarios.<n> TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks.<n>We conduct extensive experiments on reasoning models and non-reasoning models.<n>We release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
arXiv Detail & Related papers (2025-05-19T09:22:02Z)
TempRetriever: Fusion-based Temporal Dense Passage Retrieval for Time-Sensitive Questions [18.87473448633352]
We propose TempRetriever, which explicitly incorporates temporal information by embedding both the query date and document timestamp into the retrieval process.<n> TempRetriever achieves a 6.63% improvement in Top-1 retrieval accuracy and a 3.79% improvement in NDCG@10 compared to the standard DPR on ArchivalQA.<n>We also propose a novel, time-based negative sampling strategy which further enhances retrieval performance by addressing temporal misalignment during training.
arXiv Detail & Related papers (2025-02-28T13:06:25Z)
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.<n>Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.<n>We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points.
arXiv Detail & Related papers (2024-07-16T17:58:27Z)
Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z)
A Dataset for Answering Time-Sensitive Questions [88.95075983560331]
Time is an important dimension in our physical world. Lots of facts can evolve with respect to time. It is important to consider the time dimension and empower the existing QA models to reason over time. The existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability.
arXiv Detail & Related papers (2021-08-13T16:42:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.