AI-Augmented Predictions: LLM Assistants Improve Human Forecasting
Accuracy
- URL: http://arxiv.org/abs/2402.07862v1
- Date: Mon, 12 Feb 2024 18:14:43 GMT
- Title: AI-Augmented Predictions: LLM Assistants Improve Human Forecasting
Accuracy
- Authors: Philipp Schoenegger, Peter S. Park, Ezra Karger, Philip E. Tetlock
- Abstract summary: Large language models (LLMs) show impressive capabilities, matching and sometimes exceeding human performance in many domains.
This study explores the potential of LLMs to augment judgement in forecasting tasks.
- Score: 2.184775414778289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show impressive capabilities, matching and
sometimes exceeding human performance in many domains. This study explores the
potential of LLMs to augment judgement in forecasting tasks. We evaluated the
impact on forecasting accuracy of two GPT-4-Turbo assistants: one designed to
provide high-quality advice ('superforecasting'), and the other designed to be
overconfident and base-rate-neglecting. Participants (N = 991) had the option
to consult their assigned LLM assistant throughout the study, in contrast to a
control group that used a less advanced model (DaVinci-003) without direct
forecasting support. Our preregistered analyses reveal that LLM augmentation
significantly enhances forecasting accuracy by 23% across both types of
assistants, compared to the control group. This improvement occurs despite the
superforecasting assistant's higher accuracy in predictions, indicating the
augmentation's benefit is not solely due to model prediction accuracy.
Exploratory analyses showed a pronounced effect in one forecasting item,
without which we find that the superforecasting assistant increased accuracy by
43%, compared with 28% for the biased assistant. We further examine whether LLM
augmentation disproportionately benefits less skilled forecasters, degrades the
wisdom-of-the-crowd by reducing prediction diversity, or varies in
effectiveness with question difficulty. Our findings do not consistently
support these hypotheses. Our results suggest that access to an LLM assistant,
even a biased one, can be a helpful decision aid in cognitively demanding tasks
where the answer is not known at the time of interaction.
Related papers
- Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs)
We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament.
We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z) - Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI [0.0]
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector.
Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact.
arXiv Detail & Related papers (2023-12-12T02:28:12Z) - Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z) - Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs)
We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z) - PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation
with GPT-4 in Cloud Incident Root Cause Analysis [17.362895895214344]
Large language models (LLMs) are used to help humans identify the root causes of cloud incidents.
We propose to perform confidence estimation for the predictions to help on-call engineers make decisions on whether to adopt the model prediction.
We show that our method is able to produce calibrated confidence estimates for predicted root causes, validate the usefulness of retrieved historical data and the prompting strategy.
arXiv Detail & Related papers (2023-09-11T21:24:00Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Expected Validation Performance and Estimation of a Random Variable's
Maximum [48.83713377993604]
We analyze three statistical estimators for expected validation performance.
We find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias.
We find that the two biased estimators lead to the fewest incorrect conclusions.
arXiv Detail & Related papers (2021-10-01T18:48:47Z) - Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks.
First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU.
Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z) - When Does Uncertainty Matter?: Understanding the Impact of Predictive
Uncertainty in ML Assisted Decision Making [68.19284302320146]
We carry out user studies to assess how people with differing levels of expertise respond to different types of predictive uncertainty.
We found that showing posterior predictive distributions led to smaller disagreements with the ML model's predictions.
This suggests that posterior predictive distributions can potentially serve as useful decision aids which should be used with caution and take into account the type of distribution and the expertise of the human.
arXiv Detail & Related papers (2020-11-12T02:23:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.