Related papers: AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy

URL: http://arxiv.org/abs/2402.07862v2
Date: Thu, 22 Aug 2024 13:57:30 GMT
Title: AI-Augmented Predictions: LLM Assistants Improve Human Forecasting Accuracy
Authors: Philipp Schoenegger, Peter S. Park, Ezra Karger, Sean Trott, Philip E. Tetlock,
Abstract summary: Large language models (LLMs) match and sometimes exceed human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task.
Score: 3.7865171120254355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) match and sometimes exceeding human performance in many domains. This study explores the potential of LLMs to augment human judgement in a forecasting task. We evaluate the effect on human forecasters of two LLM assistants: one designed to provide high-quality ("superforecasting") advice, and the other designed to be overconfident and base-rate neglecting, thus providing noisy forecasting advice. We compare participants using these assistants to a control group that received a less advanced model that did not provide numerical predictions or engaged in explicit discussion of predictions. Participants (N = 991) answered a set of six forecasting questions and had the option to consult their assigned LLM assistant throughout. Our preregistered analyses show that interacting with each of our frontier LLM assistants significantly enhances prediction accuracy by between 24 percent and 28 percent compared to the control group. Exploratory analyses showed a pronounced outlier effect in one forecasting item, without which we find that the superforecasting assistant increased accuracy by 41 percent, compared with 29 percent for the noisy assistant. We further examine whether LLM forecasting augmentation disproportionately benefits less skilled forecasters, degrades the wisdom-of-the-crowd by reducing prediction diversity, or varies in effectiveness with question difficulty. Our data do not consistently support these hypotheses. Our results suggest that access to a frontier LLM assistant, even a noisy one, can be a helpful decision aid in cognitively demanding tasks compared to a less powerful model that does not provide specific forecasting advice. However, the effects of outliers suggest that further research into the robustness of this pattern is needed.

Related papers

Consistency Checks for Language Model Forecasters [54.62507816753479]
We measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits predictions of the forecaster, and measures the consistency of the predictions.
arXiv Detail & Related papers (2024-12-24T16:51:35Z)
Hybrid Forecasting of Geopolitical Events [71.73737011120103]
SAGE is a hybrid forecasting system that combines human and machine generated forecasts. The system aggregates human and machine forecasts weighting both for propinquity and based on assessed skill. We show that skilled forecasters who had access to machine-generated forecasts outperformed those who only viewed historical data.
arXiv Detail & Related papers (2024-12-14T22:09:45Z)
Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks. Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs. We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z)
LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction [38.11497959553319]
We investigate the feasibility of applying Large Language Models to convert structured patient visit data into natural language narratives. We evaluate the zero-shot and few-shot performance of LLMs using various EHR-prediction-oriented prompting strategies. Our results demonstrate that with the proposed approach, LLMs can achieve decent few-shot performance compared to traditional supervised learning methods in EHR-based disease predictions.
arXiv Detail & Related papers (2024-03-19T18:10:13Z)
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs) We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z)
Humans vs Large Language Models: Judgmental Forecasting in an Era of Advanced AI [0.0]
This study investigates the forecasting accuracy of human experts versus Large Language Models (LLMs) in the retail sector. Our analysis centered on the effect of the following factors on forecasters performance: the supporting statistical model (baseline and advanced), whether the product was on promotion, and the nature of external impact.
arXiv Detail & Related papers (2023-12-12T02:28:12Z)
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs) We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z)
PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis [17.362895895214344]
Large language models (LLMs) are used to help humans identify the root causes of cloud incidents. We propose to perform confidence estimation for the predictions to help on-call engineers make decisions on whether to adopt the model prediction. We show that our method is able to produce calibrated confidence estimates for predicted root causes, validate the usefulness of retrieved historical data and the prompting strategy.
arXiv Detail & Related papers (2023-09-11T21:24:00Z)
Prediction-Oriented Bayesian Active Learning [51.426960808684655]
Expected predictive information gain (EPIG) is an acquisition function that measures information gain in the space of predictions rather than parameters. EPIG leads to stronger predictive performance compared with BALD across a range of datasets and models.
arXiv Detail & Related papers (2023-04-17T10:59:57Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z)
Towards Auditability for Fairness in Deep Learning [1.052782170493037]
Group fairness metrics can detect when a deep learning model behaves differently for advantaged and disadvantaged groups. We present smooth prediction sensitivity, an efficiently computed measure of individual fairness for deep learning models.
arXiv Detail & Related papers (2020-11-30T21:28:12Z)
When Does Uncertainty Matter?: Understanding the Impact of Predictive Uncertainty in ML Assisted Decision Making [68.19284302320146]
We carry out user studies to assess how people with differing levels of expertise respond to different types of predictive uncertainty. We found that showing posterior predictive distributions led to smaller disagreements with the ML model's predictions. This suggests that posterior predictive distributions can potentially serve as useful decision aids which should be used with caution and take into account the type of distribution and the expertise of the human.
arXiv Detail & Related papers (2020-11-12T02:23:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.