TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
- URL: http://arxiv.org/abs/2311.17667v2
- Date: Fri, 28 Jun 2024 10:40:26 GMT
- Title: TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
- Authors: Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, Bing Qin,
- Abstract summary: We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark.
TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.
Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
- Score: 29.656403397725395
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different reasoning categories. Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning. Resources are available at: https://github.com/zchuz/TimeBench
Related papers
- Timo: Towards Better Temporal Reasoning for Language Models [38.27548375148604]
Reasoning about time is essential for Large Language Models to understand the world.
We build a universal framework to handle a variety of temporal reasoning tasks.
We develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales.
arXiv Detail & Related papers (2024-06-20T10:52:14Z) - Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world.
CoTempQA is a benchmark containing four co-temporal scenarios.
Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - MenatQA: A New Dataset for Testing the Temporal Comprehension and
Reasoning Abilities of Large Language Models [17.322480769274062]
Large language models (LLMs) have shown nearly saturated performance on many natural language processing (NLP) tasks.
This paper constructs Multiple Sensitive Factors Time QA (MenatQA) with total 2,853 samples for evaluating the time comprehension and reasoning abilities of LLMs.
arXiv Detail & Related papers (2023-10-08T13:19:52Z) - Back to the Future: Towards Explainable Temporal Reasoning with Large
Language Models [33.8108950744839]
We introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context.
We show that our method achieves the state-of-the-art performance of temporal prediction and explanation.
arXiv Detail & Related papers (2023-10-02T10:35:23Z) - Reason for Future, Act for Now: A Principled Framework for Autonomous
LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting.
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon.
At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z) - Towards Benchmarking and Improving the Temporal Reasoning Capability of
Large Language Models [44.670550143705746]
We introduce a comprehensive probing dataset tempreason to evaluate the temporal reasoning capability of large language models.
Our dataset includes questions of three temporal reasoning levels.
We also propose a novel learning framework to improve the temporal reasoning capability of large language models.
arXiv Detail & Related papers (2023-06-15T08:44:41Z) - Temporal Reasoning on Implicit Events from Distant Supervision [91.20159064951487]
We propose a novel temporal reasoning dataset that evaluates the degree to which systems understand implicit events.
We find that state-of-the-art models struggle when predicting temporal relationships between implicit and explicit events.
We propose a neuro-symbolic temporal reasoning model, SYMTIME, which exploits distant supervision signals from large-scale text and uses temporal rules to infer end times.
arXiv Detail & Related papers (2020-10-24T03:12:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.