Related papers: SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

URL: http://arxiv.org/abs/2511.09993v1
Date: Fri, 14 Nov 2025 01:24:48 GMT
Title: SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models
Authors: Zhongjian Miao, Hao Fu, Chen Wei,
Abstract summary: We introduce SPAN, a cross-calendar temporal reasoning benchmark.<n>SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars.<n>To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation.
Score: 7.437301045895224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs' cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

Related papers

TSAQA: Time Series Analysis Question And Answering Benchmark [85.35545785252309]
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science.<n>We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities.
arXiv Detail & Related papers (2026-01-30T17:28:56Z)
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning [50.81994347448835]
We propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design.<n>Experiments on CalBench show that PEARL achieves 0.76 error reduction rate, and 55% in average error rate compared to the strongest baseline.
arXiv Detail & Related papers (2026-01-17T08:19:18Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models [12.948099229475265]
Large language models (LLMs) face significant challenges in ex-ante reasoning.<n>Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff.<n>This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints.
arXiv Detail & Related papers (2025-05-26T05:39:57Z)
Temporal Alignment of LLMs through Cycle Encoding for Long-Range Time Representations [57.01193643163492]
Large language models (LLMs) suffer from temporal misalignment issues especially across long span of time.<n>This paper proposes a methodology named "Ticktack" for addressing the LLM's long-time span misalignment in a yearly setting.
arXiv Detail & Related papers (2025-03-06T06:59:09Z)
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle [13.192628306219248]
We propose using future event prediction as a continuous evaluation method to assess Large Language Models' temporal generalization and forecasting abilities.<n>Our benchmark, Daily Oracle, automatically generates question-answer pairs from daily news, challenging LLMs to predict "future" event outcomes.
arXiv Detail & Related papers (2024-11-13T04:20:20Z)
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.<n>We find that fine-tuning LLMs with raw texts can significantly improve performance.<n>However, issues such as popularity bias and the long-tail problem persist in LLMs.
arXiv Detail & Related papers (2024-07-16T11:58:54Z)
Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.