Related papers: TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models

URL: http://arxiv.org/abs/2311.17667v2
Date: Fri, 28 Jun 2024 10:40:26 GMT
Title: TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
Authors: Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, Bing Qin,
Abstract summary: We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
Score: 29.656403397725395
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different reasoning categories. Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning. Resources are available at: https://github.com/zchuz/TimeBench

Related papers

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency [59.05753942719665]
We propose a novel temporal robustness benchmark (TemRobBench) to assess the robustness of models.<n>We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments.<n>We design panoramic direct preference optimization (PanoDPO) to encourage LMMs to incorporate both visual and linguistic feature preferences simultaneously.
arXiv Detail & Related papers (2025-05-20T14:18:56Z)
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios [26.668042778743835]
We propose a benchmark TIME, designed for temporal reasoning in real-world scenarios.<n> TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks.<n>We conduct extensive experiments on reasoning models and non-reasoning models.<n>We release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
arXiv Detail & Related papers (2025-05-19T09:22:02Z)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos. It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips. Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Timo: Towards Better Temporal Reasoning for Language Models [38.27548375148604]
Reasoning about time is essential for Large Language Models to understand the world. We build a universal framework to handle a variety of temporal reasoning tasks. We develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales.
arXiv Detail & Related papers (2024-06-20T10:52:14Z)
Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models [33.8108950744839]
We introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context. We show that our method achieves the state-of-the-art performance of temporal prediction and explanation.
arXiv Detail & Related papers (2023-10-02T10:35:23Z)
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting. Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon. At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z)
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models [44.670550143705746]
We introduce a comprehensive probing dataset tempreason to evaluate the temporal reasoning capability of large language models. Our dataset includes questions of three temporal reasoning levels. We also propose a novel learning framework to improve the temporal reasoning capability of large language models.
arXiv Detail & Related papers (2023-06-15T08:44:41Z)
Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning. We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)
Temporal Reasoning on Implicit Events from Distant Supervision [91.20159064951487]
We propose a novel temporal reasoning dataset that evaluates the degree to which systems understand implicit events. We find that state-of-the-art models struggle when predicting temporal relationships between implicit and explicit events. We propose a neuro-symbolic temporal reasoning model, SYMTIME, which exploits distant supervision signals from large-scale text and uses temporal rules to infer end times.
arXiv Detail & Related papers (2020-10-24T03:12:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.