A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting
- URL: http://arxiv.org/abs/2407.11638v1
- Date: Tue, 16 Jul 2024 11:58:54 GMT
- Title: A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting
- Authors: He Chang, Chenchen Ye, Zhulin Tao, Jie Wu, Zhengmao Yang, Yunshan Ma, Xianglin Huang, Tat-Seng Chua,
- Abstract summary: We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.
We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance.
In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
- Score: 45.0261082985087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Large Language Models (LLMs) have demonstrated great potential in various data mining tasks, such as knowledge question answering, mathematical reasoning, and commonsense reasoning. However, the reasoning capability of LLMs on temporal event forecasting has been under-explored. To systematically investigate their abilities in temporal event forecasting, we conduct a comprehensive evaluation of LLM-based methods for temporal event forecasting. Due to the lack of a high-quality dataset that involves both graph and textual data, we first construct a benchmark dataset, named MidEast-TE-mini. Based on this dataset, we design a series of baseline methods, characterized by various input formats and retrieval augmented generation(RAG) modules. From extensive experiments, we find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance. Moreover, enhanced with retrieval modules, LLM can effectively capture temporal relational patterns hidden in historical events. Meanwhile, issues such as popularity bias and the long-tail problem still persist in LLMs, particularly in the RAG-based method. These findings not only deepen our understanding of LLM-based event forecasting methods but also highlight several promising research directions.We consider that this comprehensive evaluation, along with the identified research opportunities, will significantly contribute to future research on temporal event forecasting through LLMs.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting [16.583730806230644]
This study highlights the key challenges that large language models encounter in the context of time series prediction.
The empirical results indicate that while large language models can perform well in zero-shot forecasting for certain datasets, their predictive accuracy diminishes notably when confronted with diverse time series data and traditional signals.
arXiv Detail & Related papers (2024-08-09T05:13:03Z) - Enhancing Temporal Understanding in LLMs for Semi-structured Tables [50.59009084277447]
We conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of large language models (LLMs)
Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for temporal temporal question answering.
We introduce a novel approach, C.L.E.A.R. to strengthen LLM capabilities in this domain.
arXiv Detail & Related papers (2024-07-22T20:13:10Z) - Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning [20.066249913943405]
Large language models (LLMs) have showcased remarkable reasoning capabilities, yet they remain susceptible to errors.
We introduce novel synthetic datasets specifically designed to assess LLM temporal reasoning abilities in various scenarios.
Our findings provide valuable insights into the strengths and weaknesses of current LLMs in temporal reasoning tasks.
arXiv Detail & Related papers (2024-06-13T14:31:19Z) - Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities [46.02234423159257]
Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years.
Recent works treat large language models as emphzero-shot time series reasoners without further fine-tuning.
Our study shows that LLMs perform well in predicting time series with clear patterns and trends, but face challenges with datasets lacking periodicity.
arXiv Detail & Related papers (2024-02-16T17:15:28Z) - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems.
We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting.
Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z) - LLM4TS: Aligning Pre-Trained LLMs as Data-Efficient Time-Series Forecasters [11.796765525301051]
We propose a framework for time-series forecasting with pre-trained Large Language Models (LLMs)
LLM4TS consists of a two-stage fine-tuning strategy to align LLMs with the nuances of time-series data, and the forecasting fine-tuning stage for downstream time-series forecasting tasks.
Our framework features a novel two-level aggregation method that integrates multi-scale temporal data within pre-trained LLMs, enhancing their ability to interpret time-specific information.
arXiv Detail & Related papers (2023-08-16T16:19:50Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.