Timo: Towards Better Temporal Reasoning for Language Models
- URL: http://arxiv.org/abs/2406.14192v2
- Date: Mon, 19 Aug 2024 03:47:16 GMT
- Title: Timo: Towards Better Temporal Reasoning for Language Models
- Authors: Zhaochen Su, Jun Zhang, Tong Zhu, Xiaoye Qu, Juntao Li, Min Zhang, Yu Cheng,
- Abstract summary: Reasoning about time is essential for Large Language Models to understand the world.
We build a universal framework to handle a variety of temporal reasoning tasks.
We develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales.
- Score: 38.27548375148604
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at https://github.com/zhaochen0110/Timo.
Related papers
- On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data [1.2979906794584584]
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored.
In this paper we work on this topic, focusing on structured and semi-structured anonymized data.
We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
arXiv Detail & Related papers (2025-04-10T10:48:42Z) - Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.
They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.
We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z) - FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.
We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.
We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.
We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding.
We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently.
Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Narrative-of-Thought: Improving Temporal Reasoning of Large Language Models via Recounted Narratives [6.631626634132574]
We study an essential task of temporal reasoning -- temporal graph generation.
We show that this task presents great challenges even for the most powerful language models.
We propose a new prompting technique tailored for temporal reasoning, Narrative-of-Thought.
arXiv Detail & Related papers (2024-10-07T23:36:05Z) - Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world.
CoTempQA is a benchmark containing four co-temporal scenarios.
Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z) - TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models [29.656403397725395]
We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark.
TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.
Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
arXiv Detail & Related papers (2023-11-29T14:30:16Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - Back to the Future: Towards Explainable Temporal Reasoning with Large
Language Models [33.8108950744839]
We introduce the first task of explainable temporal reasoning, to predict an event's occurrence at a future timestamp based on context.
We show that our method achieves the state-of-the-art performance of temporal prediction and explanation.
arXiv Detail & Related papers (2023-10-02T10:35:23Z) - An Overview Of Temporal Commonsense Reasoning and Acquisition [20.108317515225504]
Temporal commonsense reasoning refers to the ability to understand the typical temporal context of phrases, actions, and events.
Recent research on the performance of large language models suggests that they often take shortcuts in their reasoning and fall prey to simple linguistic traps.
arXiv Detail & Related papers (2023-07-28T01:30:15Z) - Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning.
We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z) - oLMpics -- On what Language Model Pre-training Captures [84.60594612120173]
We propose eight reasoning tasks, which require operations such as comparison, conjunction, and composition.
A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data.
arXiv Detail & Related papers (2019-12-31T12:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.