Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
- URL: http://arxiv.org/abs/2411.08324v1
- Date: Wed, 13 Nov 2024 04:20:20 GMT
- Title: Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
- Authors: Hui Dai, Ryan Teehan, Mengye Ren,
- Abstract summary: We propose using future event prediction as a continuous evaluation method to assess Large Language Models' temporal generalization abilities.
Our benchmark, Daily Oracle, automatically generates question-answer pairs from daily news, challenging LLMs to predict "future" event outcomes.
- Score: 13.192628306219248
- License:
- Abstract: Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
Related papers
- Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation.
Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z) - Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.
LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.
Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z) - TimeRAG: BOOSTING LLM Time Series Forecasting via Retrieval-Augmented Generation [5.607649016637917]
TimeRAG is a framework that incorporates Retrieval-Augmented Generation (RAG) into time series forecasting LLMs.
Experiments on datasets from various domains show that the integration of RAG improved the prediction accuracy of the original model by 2.97% on average.
arXiv Detail & Related papers (2024-12-21T14:27:38Z) - Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training.
Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization.
We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z) - A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.
We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance.
In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization [37.58752947129519]
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies.
Traditional benchmarks, which are often static, fail to capture the continually changing information landscape.
Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts.
arXiv Detail & Related papers (2024-05-14T09:31:31Z) - Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities [46.02234423159257]
Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years.
Recent works treat large language models as emphzero-shot time series reasoners without further fine-tuning.
Our study shows that LLMs perform well in predicting time series with clear patterns and trends, but face challenges with datasets lacking periodicity.
arXiv Detail & Related papers (2024-02-16T17:15:28Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization [50.20034493626049]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets.
Existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets.
We show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data.
arXiv Detail & Related papers (2023-05-03T08:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.