Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
- URL: http://arxiv.org/abs/2502.16922v1
- Date: Mon, 24 Feb 2025 07:27:54 GMT
- Title: Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
- Authors: Zhenglin Wang, Jialong Wu, Pengfei LI, Yong Jiang, Deyu Zhou,
- Abstract summary: We introduce Chinese Time Reasoning (CTM), a benchmark to evaluate Large Language Models on temporal reasoning.<n>CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning.
- Score: 38.87423278027958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal reasoning is fundamental to human cognition and is crucial for various real-world applications. While recent advances in Large Language Models have demonstrated promising capabilities in temporal reasoning, existing benchmarks primarily rely on rule-based construction, lack contextual depth, and involve a limited range of temporal entities. To address these limitations, we introduce Chinese Time Reasoning (CTM), a benchmark designed to evaluate LLMs on temporal reasoning within the extensive scope of Chinese dynastic chronology. CTM emphasizes cross-entity relationships, pairwise temporal alignment, and contextualized and culturally-grounded reasoning, providing a comprehensive evaluation. Extensive experimental results reveal the challenges posed by CTM and highlight potential avenues for improvement.
Related papers
- A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs [38.304628241767055]
We introduce STReason, a framework that integrates large language models with analytical capabilities for multi-task inference and execution.<n>We show that STReason significantly outperforms LLM baselines across all metrics, particularly in excelling in complex, reasoningintensive-temporal scenarios.<n>Human evaluations validate STReason's credibility and practical utility, demonstrating potential to reduce expert workload and broaden the applicability to real-world, multi-faceted decision scenarios.
arXiv Detail & Related papers (2025-06-25T00:55:34Z) - Enhancing LLM Reasoning for Time Series Classification by Tailored Thinking and Fused Decision [8.256998757769322]
ReasonTSC is a framework designed to leverage LLM reasoning for time series classification.<n>It steers the model to think over the essential characteristics of time series data.<n>It integrates predictions and confidence scores from plug-in classifiers, e.g., domain-specific time series models, as in-context examples.
arXiv Detail & Related papers (2025-06-01T03:15:54Z) - Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting [17.73769436497384]
Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling.<n>Meanwhile, emerging slow-thinking LLMs have demonstrated impressive multi-step reasoning capabilities across diverse domains.<n>This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner?
arXiv Detail & Related papers (2025-05-30T12:19:02Z) - Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency [59.05753942719665]
We propose a novel temporal robustness benchmark (TemRobBench) to assess the robustness of models.<n>We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments.<n>We design panoramic direct preference optimization (PanoDPO) to encourage LMMs to incorporate both visual and linguistic feature preferences simultaneously.
arXiv Detail & Related papers (2025-05-20T14:18:56Z) - Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.
They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.
We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z) - DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [86.93099925711388]
We propose textbfDetectiveQA, a dataset specifically designed for narrative reasoning within long contexts.
We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English.
arXiv Detail & Related papers (2024-09-04T06:28:22Z) - Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world.
CoTempQA is a benchmark containing four co-temporal scenarios.
Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories.
Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z) - TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models [29.656403397725395]
We propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark.
TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.
Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans.
arXiv Detail & Related papers (2023-11-29T14:30:16Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - DetermLR: Augmenting LLM-based Logical Reasoning from Indeterminacy to Determinacy [76.58614128865652]
We propose DetermLR, a novel perspective that rethinks the reasoning process as an evolution from indeterminacy to determinacy.
First, we categorize known conditions into two types: determinate and indeterminate premises This provides an oveall direction for the reasoning process and guides LLMs in converting indeterminate data into progressively determinate insights.
We automate the storage and extraction of available premises and reasoning paths with reasoning memory, preserving historical reasoning details for subsequent reasoning steps.
arXiv Detail & Related papers (2023-10-28T10:05:51Z) - Exploring Progress in Multivariate Time Series Forecasting: Comprehensive Benchmarking and Heterogeneity Analysis [70.78170766633039]
We address the need for means of assessing MTS forecasting proposals reliably and fairly.
BasicTS+ is a benchmark designed to enable fair, comprehensive, and reproducible comparison of MTS forecasting solutions.
We apply BasicTS+ along with rich datasets to assess the capabilities of more than 45 MTS forecasting solutions.
arXiv Detail & Related papers (2023-10-09T19:52:22Z) - TRAM: Benchmarking Temporal Reasoning for Large Language Models [12.112914393948415]
We introduce TRAM, a temporal reasoning benchmark composed of ten datasets.
We evaluate popular language models like GPT-4 and Llama2 in zero-shot and few-shot scenarios.
Our findings indicate that the best-performing model lags significantly behind human performance.
arXiv Detail & Related papers (2023-10-02T00:59:07Z) - Unlocking Temporal Question Answering for Large Language Models with Tailor-Made Reasoning Logic [84.59255070520673]
Large language models (LLMs) face a challenge when engaging in temporal reasoning.
We propose TempLogic, a novel framework designed specifically for temporal question-answering tasks.
arXiv Detail & Related papers (2023-05-24T10:57:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.