Towards Benchmarking and Improving the Temporal Reasoning Capability of
Large Language Models
- URL: http://arxiv.org/abs/2306.08952v2
- Date: Tue, 27 Jun 2023 05:39:25 GMT
- Title: Towards Benchmarking and Improving the Temporal Reasoning Capability of
Large Language Models
- Authors: Qingyu Tan, Hwee Tou Ng, Lidong Bing
- Abstract summary: We introduce a comprehensive probing dataset tempreason to evaluate the temporal reasoning capability of large language models.
Our dataset includes questions of three temporal reasoning levels.
We also propose a novel learning framework to improve the temporal reasoning capability of large language models.
- Score: 44.670550143705746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning about time is of fundamental importance. Many facts are
time-dependent. For example, athletes change teams from time to time, and
different government officials are elected periodically. Previous
time-dependent question answering (QA) datasets tend to be biased in either
their coverage of time spans or question types. In this paper, we introduce a
comprehensive probing dataset \tempreason to evaluate the temporal reasoning
capability of large language models. Our dataset includes questions of three
temporal reasoning levels. In addition, we also propose a novel learning
framework to improve the temporal reasoning capability of large language
models, based on temporal span extraction and time-sensitive reinforcement
learning. We conducted experiments in closed book QA, open book QA, and
reasoning QA settings and demonstrated the effectiveness of our approach. Our
code and data are released on https://github.com/DAMO-NLP-SG/TempReason.
Related papers
- Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time [0.0]
We introduce a novel dataset designed to rigorously test large language models' ability to handle time-sensitive facts.
Our benchmark offers a systematic way to measure how well LLMs align their knowledge with the correct time context.
arXiv Detail & Related papers (2024-09-20T08:57:20Z) - Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world.
CoTempQA is a benchmark containing four co-temporal scenarios.
Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z) - Language Models Still Struggle to Zero-shot Reason about Time Series [11.764833497297493]
Time series are critical for decision-making in fields like finance and healthcare.
It remains unknown whether non-trivial forecasting implies that language models can reason about time series.
We generate a first-of-its-kind evaluation framework for time series reasoning.
arXiv Detail & Related papers (2024-04-17T21:27:33Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - Time-Aware Representation Learning for Time-Sensitive Question Answering [19.822549681087107]
We propose a Time-Context aware Question Answering (TCQA) framework.
We build a time-context dependent data generation framework for model training.
We present a metric to evaluate the time awareness of the QA model.
arXiv Detail & Related papers (2023-10-19T08:48:45Z) - A Benchmark for Generalizable and Interpretable Temporal Question
Answering over Knowledge Bases [67.33560134350427]
TempQA-WD is a benchmark dataset for temporal reasoning.
It is based on Wikidata, which is the most frequently curated, openly available knowledge base.
arXiv Detail & Related papers (2022-01-15T08:49:09Z) - A Dataset for Answering Time-Sensitive Questions [88.95075983560331]
Time is an important dimension in our physical world. Lots of facts can evolve with respect to time.
It is important to consider the time dimension and empower the existing QA models to reason over time.
The existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability.
arXiv Detail & Related papers (2021-08-13T16:42:25Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.