Related papers: Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models

URL: http://arxiv.org/abs/2504.05258v2
Date: Fri, 30 May 2025 15:37:19 GMT
Title: Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models
Authors: Adrián Bazaga, Rexhina Blloshmi, Bill Byrne, Adrià de Gispert,
Abstract summary: Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.<n>They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.<n>We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
Score: 21.579319926212296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.

Related papers

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models [21.438802784706994]
We propose VisRef, a visually grounded test-time scaling framework.<n>Our key idea is to actively guide the reasoning process by re-injecting a coreset of visual tokens.<n>Under fixed test-time compute budgets, VisRef consistently outperforms existing test-time scaling approaches by up to 6.4%.
arXiv Detail & Related papers (2026-02-27T11:48:19Z)
UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z)
Timely Machine: Awareness of Time Makes Test-Time Scaling Agentic [72.97800570813175]
We propose Timely Machine, redefining test-time as wall-clock time.<n>We introduce Timely-Eval, a benchmark spanning high-frequency tool calls, low-frequency tool calls, and time-constrained reasoning.<n>We find smaller models excel with fast feedback through more interactions, while larger models dominate high-latency settings via superior interaction quality.
arXiv Detail & Related papers (2026-01-23T06:28:52Z)
Enhancing Temporal Awareness in LLMs for Temporal Point Processes [53.596733432865626]
Temporal point processes (TPPs) are crucial for analyzing events over time.<n> TPP-TAL is a novel plug-and-play framework designed to enhance temporal reasoning within large language models.<n> TPP-TAL delivers substantial improvements in temporal likelihood estimation and event prediction accuracy.
arXiv Detail & Related papers (2025-12-29T03:01:24Z)
TimeSense:Making Large Language Models Proficient in Time-Series Analysis [26.44226032396234]
In the time-series domain, an increasing number of works combine text with temporal data to leverage the reasoning capabilities of large language models.<n>We propose TimeSense, a framework that makes LLMs proficient in time-series analysis by balancing textual reasoning with a preserved temporal sense.<n>TimeSense achieves state-of-the-art performance across multiple tasks, and it particularly outperforms existing methods on complex multi-dimensional time-series reasoning tasks.
arXiv Detail & Related papers (2025-11-09T12:00:18Z)
MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning [22.89546852658161]
Temporal Knowledge Graphs offer a reliable source for temporal reasoning.<n>Existing TKG-based LLM reasoning methods still struggle with four major challenges.<n>We propose MemoTime, a memory-augmented temporal knowledge graph framework.
arXiv Detail & Related papers (2025-10-15T14:43:31Z)
Test-time Prompt Intervention [16.9160718076699]
We propose PI, a novel framework for Test-time Prompt Intervention.<n> PI provides an interface to dynamically guide and regulate reasoning paths during inference.<n>This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes.
arXiv Detail & Related papers (2025-08-04T15:17:13Z)
A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs [38.304628241767055]
We introduce STReason, a framework that integrates large language models with analytical capabilities for multi-task inference and execution.<n>We show that STReason significantly outperforms LLM baselines across all metrics, particularly in excelling in complex, reasoningintensive-temporal scenarios.<n>Human evaluations validate STReason's credibility and practical utility, demonstrating potential to reduce expert workload and broaden the applicability to real-world, multi-faceted decision scenarios.
arXiv Detail & Related papers (2025-06-25T00:55:34Z)
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z)
Enhancing LLM Reasoning for Time Series Classification by Tailored Thinking and Fused Decision [8.256998757769322]
ReasonTSC is a framework designed to leverage LLM reasoning for time series classification.<n>It steers the model to think over the essential characteristics of time series data.<n>It integrates predictions and confidence scores from plug-in classifiers, e.g., domain-specific time series models, as in-context examples.
arXiv Detail & Related papers (2025-06-01T03:15:54Z)
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods [39.89239733570008]
This work conducts a comprehensive analysis of inference-time scaling methods for both reasoning and non-reasoning models. We find that non-reasoning models, even with an extremely high inference budget, still fall substantially behind reasoning models. For reasoning models, majority voting proves to be a robust inference strategy, generally competitive or outperforming other more sophisticated ITC methods.
arXiv Detail & Related papers (2025-04-18T19:32:55Z)
On the Temporal Question-Answering Capabilities of Large Language Models Over Anonymized Data [1.2979906794584584]
The applicability of Large Language Models (LLMs) in temporal reasoning tasks over data that is not present during training is still a field that remains to be explored. In this paper we work on this topic, focusing on structured and semi-structured anonymized data. We identify and examined seventeen common temporal reasoning tasks in natural language, focusing on their algorithmic components.
arXiv Detail & Related papers (2025-04-10T10:48:42Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [54.04678363287392]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights [49.42133807824413]
We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training. OpenAI's o1 model shows promising performance through its novel use of multi-step reasoning and verification.
arXiv Detail & Related papers (2025-02-18T04:11:29Z)
ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z)
Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE) This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z)
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model [86.9619638550683]
Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data.<n>However, these models display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of decision shortcuts''
arXiv Detail & Related papers (2024-03-01T09:01:53Z)
Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge. We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
TRAM: Benchmarking Temporal Reasoning for Large Language Models [12.112914393948415]
We introduce TRAM, a temporal reasoning benchmark composed of ten datasets. We evaluate popular language models like GPT-4 and Llama2 in zero-shot and few-shot scenarios. Our findings indicate that the best-performing model lags significantly behind human performance.
arXiv Detail & Related papers (2023-10-02T00:59:07Z)
Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning. We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.