Related papers: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

URL: http://arxiv.org/abs/2505.12891v2
Date: Sat, 19 Jul 2025 04:52:39 GMT
Title: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
Authors: Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang,
Abstract summary: We propose a benchmark TIME, designed for temporal reasoning in real-world scenarios.<n> TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks.<n>We conduct extensive experiments on reasoning models and non-reasoning models.<n>We release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
Score: 26.668042778743835
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .

Related papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time [0.0]
In real-world scenarios, the correctness of answers is frequently tied to temporal context.<n>We present a novel framework and dataset spanning over 8,000 events from 2018 to 2024.<n>Our work provides a significant step toward advancing time-aware language models.
arXiv Detail & Related papers (2024-09-20T08:57:20Z)
Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z)
TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models [29.656403397725395]
We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
arXiv Detail & Related papers (2023-11-29T14:30:16Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning. We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)
Generic Temporal Reasoning with Differential Analysis and Explanation [61.96034987217583]
We introduce a novel task named TODAY that bridges the gap with temporal differential analysis. TODAY evaluates whether systems can correctly understand the effect of incremental changes. We show that TODAY's supervision style and explanation annotations can be used in joint learning.
arXiv Detail & Related papers (2022-12-20T17:40:03Z)
A Dataset for Answering Time-Sensitive Questions [88.95075983560331]
Time is an important dimension in our physical world. Lots of facts can evolve with respect to time. It is important to consider the time dimension and empower the existing QA models to reason over time. The existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability.
arXiv Detail & Related papers (2021-08-13T16:42:25Z)
TIMEDIAL: Temporal Commonsense Reasoning in Dialog [43.24596551545824]
We present the first study to investigate pre-trained language models for their temporal reasoning capabilities in dialogs. We formulate TIME-DIAL as a multiple-choice cloze task with over 1.1K carefully curated dialogs. Empirical results demonstrate that even the best performing models struggle on this task compared to humans.
arXiv Detail & Related papers (2021-06-08T17:59:21Z)
Interpretable Time-series Representation Learning With Multi-Level Disentanglement [56.38489708031278]
Disentangle Time Series (DTS) is a novel disentanglement enhancement framework for sequential data. DTS generates hierarchical semantic concepts as the interpretable and disentangled representation of time-series. DTS achieves superior performance in downstream applications, with high interpretability of semantic concepts.
arXiv Detail & Related papers (2021-05-17T22:02:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.