Related papers: AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities

AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities

URL: http://arxiv.org/abs/2412.09385v2
Date: Tue, 22 Apr 2025 13:56:32 GMT
Title: AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
Authors: Fabrizio Davide, Pietro Torre, Leonardo Ercolani, Andrea Gaggioli,
Abstract summary: We tasked 16 state-of-the-art large language models with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030.<n>To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR)
Score: 0.3428444467046466
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We tasked 16 state-of-the-art large language models (LLMs) with estimating the likelihood of Artificial General Intelligence (AGI) emerging by 2030. To assess the quality of these forecasts, we implemented an automated peer review process (LLM-PR). The LLMs' estimates varied widely, ranging from 3% (Reka- Core) to 47.6% (GPT-4o), with a median of 12.5%. These estimates closely align with a recent expert survey that projected a 10% likelihood of AGI by 2027, underscoring the relevance of LLMs in forecasting complex, speculative scenarios. The LLM-PR process demonstrated strong reliability, evidenced by a high Intraclass Correlation Coefficient (ICC = 0.79), reflecting notable consistency in scoring across the models. Among the models, Pplx-70b-online emerged as the top performer, while Gemini-1.5-pro-api ranked the lowest. A cross-comparison with external benchmarks, such as LMSYS Chatbot Arena, revealed that LLM rankings remained consistent across different evaluation methods, suggesting that existing benchmarks may not encapsulate some of the skills relevant for AGI prediction. We further explored the use of weighting schemes based on external benchmarks, optimizing the alignment of LLMs' predictions with human expert forecasts. This analysis led to the development of a new, 'AGI benchmark' designed to highlight performance differences in AGI-related tasks. Our findings offer insights into LLMs' capabilities in speculative, interdisciplinary forecasting tasks and emphasize the growing need for innovative evaluation frameworks for assessing AI performance in complex, uncertain real-world scenarios.

Related papers

Exploring the Potential of LLMs for Serendipity Evaluation in Recommender Systems [10.227007419503297]
Large language models (LLMs) are increasingly revolutionizing evaluation methodologies across various human annotation tasks.<n>We conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains.<n>Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics.
arXiv Detail & Related papers (2025-07-23T07:51:56Z)
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research [33.79419161415481]
AbGen is the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research.<n>To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems.
arXiv Detail & Related papers (2025-07-17T17:09:22Z)
ESGenius: Benchmarking LLMs on Environmental, Social, and Governance (ESG) and Sustainability Knowledge [53.18163869901266]
ESGenius is a benchmark for evaluating and enhancing the proficiency of Large Language Models (LLMs) in Environmental, Social and Governance (ESG)<n> ESGenius comprises two key components: ESGenius-QA and ESGenius-Corpus.
arXiv Detail & Related papers (2025-06-02T13:19:09Z)
LLM-based Automated Grading with Human-in-the-Loop [32.14015215819979]
Large language models (LLMs) are increasingly being used for automatic short answer grading (ASAG) In this work, we explore the potential of LLMs in ASAG tasks by leveraging their interactive capabilities through a human-in-the-loop (HITL) approach. Our proposed framework, GradeHITL, utilizes the generative properties of LLMs to pose questions to human experts, incorporating their insights to refine grading rubrics dynamically.
arXiv Detail & Related papers (2025-04-07T16:23:07Z)
BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models [0.0]
We introduce BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs) We present a bias benchmark for LLMs that measure performance across 29 distinct metrics. These metrics span a broad range of characteristics, including demographic, cognitive, and social biases, as well as measures of ethical reasoning, group fairness, and factuality related misinformation risk.
arXiv Detail & Related papers (2025-03-31T16:56:52Z)
Forecasting Frontier Language Model Agent Capabilities [0.7499722271664147]
We evaluate six forecasting methods that predict downstream capabilities of Language Models (LMs) We use "one-step" approaches that predict benchmark scores from input metrics like compute or model release date directly or "two-step" approaches that first predict an intermediate metric like the principal component of cross-benchmark performance (PC-1) and human-evaluated competitive Elo ratings. Our forecast predicts that by the beginning of 2026, non-specialized LM agents with low capability elicitation will reach a success rate of 54% on SWE-Bench Verified, while state-of-the-art LM agents will reach an 87% success rate.
arXiv Detail & Related papers (2025-02-21T02:34:17Z)
A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed. We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness. Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z)
ELF-Gym: Evaluating Large Language Models Generated Features for Tabular Prediction [33.03433653251314]
We propose ELF-Gym, a framework for evaluating Large Language Models (LLMs) We curated a new dataset from historical Kaggle competitions, including 251 "golden" features used by top-performing teams. We empirically demonstrate that, in the best-case scenario, LLMs can semantically capture approximately 56% of the golden features, but at the more demanding implementation level this overlap drops to 13%.
arXiv Detail & Related papers (2024-10-13T13:59:33Z)
Efficacy of Large Language Models in Systematic Reviews [0.0]
This study investigates the effectiveness of Large Language Models (LLMs) in interpreting existing literature. We compiled and hand-coded a database of 88 relevant papers published from March 2020 to May 2024. We evaluated two current state-of-the-art LLMs, Meta AI's Llama 3 8B and OpenAI's GPT-4o, on the accuracy of their interpretations.
arXiv Detail & Related papers (2024-08-03T00:01:13Z)
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks [106.09361690937618]
There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data. We evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations.
arXiv Detail & Related papers (2024-06-26T14:56:13Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z)
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy [1.999925939110439]
We use an ensemble approach consisting of a crowd of twelve large language models (LLMs) We compare the aggregated LLM predictions on 31 binary questions to that of a crowd of human forecasters from a three-month forecasting tournament. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information.
arXiv Detail & Related papers (2024-02-29T17:27:59Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs) We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.