Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?
- URL: http://arxiv.org/abs/2406.09072v1
- Date: Thu, 13 Jun 2024 12:56:21 GMT
- Title: Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning?
- Authors: Zhaochen Su, Juntao Li, Jun Zhang, Tong Zhu, Xiaoye Qu, Pan Zhou, Yan Bowen, Yu Cheng, Min zhang,
- Abstract summary: Temporal reasoning is fundamental for large language models to comprehend the world.
CoTempQA is a benchmark containing four co-temporal scenarios.
Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
- Score: 70.19200858203388
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal reasoning is fundamental for large language models (LLMs) to comprehend the world. Current temporal reasoning datasets are limited to questions about single or isolated events, falling short in mirroring the realistic temporal characteristics involving concurrent nature and intricate temporal interconnections. In this paper, we introduce CoTempQA, a comprehensive co-temporal Question Answering (QA) benchmark containing four co-temporal scenarios (Equal, Overlap, During, Mix) with 4,748 samples for evaluating the co-temporal comprehension and reasoning abilities of LLMs. Our extensive experiments reveal a significant gap between the performance of current LLMs and human-level reasoning on CoTempQA tasks. Even when enhanced with Chain of Thought (CoT) methodologies, models consistently struggle with our task. In our preliminary exploration, we discovered that mathematical reasoning plays a significant role in handling co-temporal events and proposed a strategy to boost LLMs' co-temporal reasoning from a mathematical perspective. We hope that our CoTempQA datasets will encourage further advancements in improving the co-temporal reasoning capabilities of LLMs. Our code is available at https://github.com/zhaochen0110/Cotempqa.
Related papers
- ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding.
We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently.
Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z) - Training Large Language Models to Reason in a Continuous Latent Space [84.5618790930725]
We introduce a new paradigm Coconut (Chain of Continuous Thought) to explore the potential of large language models (LLMs) reasoning in an unrestricted latent space.
Experiments show that Coconut can effectively augment the LLM on several reasoning tasks.
These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.
arXiv Detail & Related papers (2024-12-09T18:55:56Z) - UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs' Memorization [34.257914212541394]
This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering benchmark.
It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real-world.
We present a series of time-sensitive event scenarios based on synthetically generated facts.
arXiv Detail & Related papers (2024-07-03T22:02:07Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models [29.656403397725395]
We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark.
TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.
Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
arXiv Detail & Related papers (2023-11-29T14:30:16Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - Reason for Future, Act for Now: A Principled Framework for Autonomous
LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting.
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon.
At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z) - Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning.
We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.