AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
- URL: http://arxiv.org/abs/2412.09385v1
- Date: Thu, 12 Dec 2024 15:52:41 GMT
- Title: AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
- Authors: Fabrizio Davide, Pietro Torre, Andrea Gaggioli,
- Abstract summary: We tasked 16 state-of-the-art large language models with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030.
To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR)
- Score: 0.3745329282477066
- License:
- Abstract: We tasked 16 state-of-the-art large language models (LLMs) with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030. To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR). The LLMs' estimates varied widely, ranging from 3% (Reka- Core) to 47.6% (GPT-4o), with a median of 12.5%. These estimates closely align with a recent expert survey that projected a 10% likelihood of AGI by 2027, underscoring the relevance of LLMs in forecasting complex, speculative scenarios. The LLM-PR process demonstrated strong reliability, evidenced by a high Intraclass Correlation Coefficient (ICC = 0.79), reflecting notable consistency in scoring across the models. Among the models, Pplx-70b-online emerged as the top performer, while Gemini-1.5-pro-api ranked the lowest. A cross-comparison with external benchmarks, such as LMSYS Chatbot Arena, revealed that LLM rankings remained consistent across different evaluation methods, suggesting that existing benchmarks may not encapsulate some of the skills relevant for AGI prediction. We further explored the use of weighting schemes based on external benchmarks, optimizing the alignment of LLMs' predictions with human expert forecasts. This analysis led to the development of a new, 'AGI benchmark' designed to highlight performance differences in AGI-related tasks. Our findings offer insights into LLMs' capabilities in speculative, interdisciplinary forecasting tasks and emphasize the growing need for innovative evaluation frameworks for assessing AI performance in complex, uncertain real-world scenarios.
Related papers
- Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis [89.60263788590893]
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression.
Existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth.
arXiv Detail & Related papers (2025-02-18T07:35:35Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction [33.03433653251314]
We propose ELF-Gym, a framework for evaluating Large Language Models (LLMs)
We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams.
We empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%.
arXiv Detail & Related papers (2024-10-13T13:59:33Z) - Efficacy of Large Language Models in Systematic Reviews [0.0]
This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature.
We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024.
We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations.
arXiv Detail & Related papers (2024-08-03T00:01:13Z) - LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments.
We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data.
We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs)
We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament.
We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs)
We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.