Related papers: Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?

URL: http://arxiv.org/abs/2406.09072v1
Date: Thu, 13 Jun 2024 12:56:21 GMT
Title: Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?
Authors: Zhaochen Su, Juntao Li, Jun Zhang, Tong Zhu, Xiaoye Qu, Pan Zhou, Yan Bowen, Yu Cheng, Min zhang,
Abstract summary: Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
Score: 70.19200858203388
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal reasoning is fundamental for large language models (LLMs) to comprehend the world. Current temporal reasoning datasets are limited to questions about single or isolated events, falling short in mirroring the realistic temporal characteristics involving concurrent nature and intricate temporal interconnections. In this paper, we introduce CoTempQA, a comprehensive co-temporal Question Answering (QA) benchmark containing four co-temporal scenarios (Equal, Overlap, During, Mix) with 4,748 samples for evaluating the co-temporal comprehension and reasoning abilities of LLMs. Our extensive experiments reveal a significant gap between the performance of current LLMs and human-level reasoning on CoTempQA tasks. Even when enhanced with Chain of Thought (CoT) methodologies, models consistently struggle with our task. In our preliminary exploration, we discovered that mathematical reasoning plays a significant role in handling co-temporal events and proposed a strategy to boost LLMs' co-temporal reasoning from a mathematical perspective. We hope that our CoTempQA datasets will encourage further advancements in improving the co-temporal reasoning capabilities of LLMs. Our code is available at https://github.com/zhaochen0110/Cotempqa.

Related papers

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios [26.668042778743835]
We propose a benchmark TIME, designed for temporal reasoning in real-world scenarios.<n> TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks.<n>We conduct extensive experiments on reasoning models and non-reasoning models.<n>We release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.
arXiv Detail & Related papers (2025-05-19T09:22:02Z)
Benchmarking Spatiotemporal Reasoning in LLMs and Reasoning Models: Capabilities and Challenges [4.668749313973097]
This paper systematically evaluate Large Language Models (LLMs) and Large Reasoning Models (LRMs) across three levels of reasoning complexity.<n>We curate 26 challenges where models answer directly or by Python Code Interpreter.<n>LRMs show robust performance across tasks with various levels of difficulty, often competing or surpassing traditional first-principle-based methods.
arXiv Detail & Related papers (2025-05-16T18:32:35Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z)
UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization [34.257914212541394]
This paper introduces UnSeenTimeQA, a novel data contamination free time-sensitive question-answering benchmark. It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real-world. It requires large language models (LLMs) to engage in genuine temporal reasoning without depending on the factual knowledge acquired during the pre-training phase.
arXiv Detail & Related papers (2024-07-03T22:02:07Z)
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers. We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z)
TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models [29.656403397725395]
We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
arXiv Detail & Related papers (2023-11-29T14:30:16Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
Back to the Future: Towards Explainable Temporal Reasoning with Large Language Models [33.8108950744839]
We introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context. We show that our method achieves the state-of-the-art performance of temporal prediction and explanation.
arXiv Detail & Related papers (2023-10-02T10:35:23Z)
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting. Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon. At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)
Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning. We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.