Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
- URL: http://arxiv.org/abs/2411.08324v1
- Date: Wed, 13 Nov 2024 04:20:20 GMT
- Title: Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
- Authors: Hui Dai, Ryan Teehan, Mengye Ren,
- Abstract summary: We propose using future event prediction as a continuous evaluation method to assess Large Language Models' temporal generalization abilities.
Our benchmark, Daily Oracle, automatically generates question-answer pairs from daily news, challenging LLMs to predict "future" event outcomes.
- Score: 13.192628306219248
- License:
- Abstract: Many existing evaluation benchmarks for Large Language Models (LLMs) quickly become outdated due to the emergence of new models and training data. These benchmarks also fall short in assessing how LLM performance changes over time, as they consist of static questions without a temporal dimension. To address these limitations, we propose using future event prediction as a continuous evaluation method to assess LLMs' temporal generalization and forecasting abilities. Our benchmark, Daily Oracle, automatically generates question-answer (QA) pairs from daily news, challenging LLMs to predict "future" event outcomes. Our findings reveal that as pre-training data becomes outdated, LLM performance degrades over time. While Retrieval Augmented Generation (RAG) has the potential to enhance prediction accuracy, the performance degradation pattern persists, highlighting the need for continuous model updates.
Related papers
- Dynamic Uncertainty Ranking: Enhancing In-Context Learning for Long-Tail Knowledge in LLMs [50.29035873837]
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training.
Long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization.
We propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions.
arXiv Detail & Related papers (2024-10-31T03:42:17Z) - A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting.
We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance.
In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Is Your LLM Outdated? Evaluating LLMs at Temporal Generalization [37.58752947129519]
The rapid advancement of Large Language Models (LLMs) highlights the urgent need for evolving evaluation methodologies.
Traditional benchmarks, which are often static, fail to capture the continually changing information landscape.
Our study examines temporal generalization, which includes the ability to understand, predict, and generate text relevant to past, present, and future contexts.
arXiv Detail & Related papers (2024-05-14T09:31:31Z) - Time Series Forecasting with LLMs: Understanding and Enhancing Model Capabilities [46.02234423159257]
Large language models (LLMs) have been applied in many fields and have developed rapidly in recent years.
Recent works treat large language models as emphzero-shot time series reasoners without further fine-tuning.
Our study shows that LLMs perform well in predicting time series with clear patterns and trends, but face challenges with datasets lacking periodicity.
arXiv Detail & Related papers (2024-02-16T17:15:28Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Can LMs Generalize to Future Data? An Empirical Analysis on Text
Summarization [50.20034493626049]
Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets.
Existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets.
We show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data.
arXiv Detail & Related papers (2023-05-03T08:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.