Predicting Empirical AI Research Outcomes with Language Models
- URL: http://arxiv.org/abs/2506.00794v1
- Date: Sun, 01 Jun 2025 02:46:31 GMT
- Title: Predicting Empirical AI Research Outcomes with Language Models
- Authors: Jiaxin Wen, Chenglei Si, Yueh-han Chen, He He, Shi Feng,
- Abstract summary: Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute.<n>We build the first benchmark for this task and compare LMs with human experts.<n>We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing.<n>We develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with.<n>In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.
- Score: 27.148683265085012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many promising-looking ideas in AI research fail to deliver, but their validation takes substantial human labor and compute. Predicting an idea's chance of success is thus crucial for accelerating empirical AI research, a skill that even expert researchers can only acquire through substantial experience. We build the first benchmark for this task and compare LMs with human experts. Concretely, given two research ideas (e.g., two jailbreaking methods), we aim to predict which will perform better on a set of benchmarks. We scrape ideas and experimental results from conference papers, yielding 1,585 human-verified idea pairs published after our base model's cut-off date for testing, and 6,000 pairs for training. We then develop a system that combines a fine-tuned GPT-4.1 with a paper retrieval agent, and we recruit 25 human experts to compare with. In the NLP domain, our system beats human experts by a large margin (64.4% v.s. 48.9%). On the full test set, our system achieves 77% accuracy, while off-the-shelf frontier LMs like o3 perform no better than random guessing, even with the same retrieval augmentation. We verify that our system does not exploit superficial features like idea complexity through extensive human-written and LM-designed robustness tests. Finally, we evaluate our system on unpublished novel ideas, including ideas generated by an AI ideation agent. Our system achieves 63.6% accuracy, demonstrating its potential as a reward model for improving idea generation models. Altogether, our results outline a promising new direction for LMs to accelerate empirical AI research.
Related papers
- GUIDE: Towards Scalable Advising for Research Ideas [9.819083407389524]
We develop a system to provide high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs.<n>Our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set.
arXiv Detail & Related papers (2025-07-09T17:59:21Z) - SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam? [51.112225746095746]
We introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers.<n>X-Masters sets a new state-of-the-art record on Humanity's Last Exam with a score of 32.1%.
arXiv Detail & Related papers (2025-07-07T17:50:52Z) - The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas [90.26363107905344]
A good idea should not simply appear to be novel, it should also result in better research after being executed.<n>To test whether AI-generated ideas lead to better research outcomes, we conduct an execution study.<n>Comparing the review scores of the same ideas before and after execution, the scores of the LLM-generated ideas decrease significantly more than expert-written ideas.
arXiv Detail & Related papers (2025-06-25T19:47:23Z) - How Well Can AI Build SD Models? [0.0]
We introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance)<n>We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction.
arXiv Detail & Related papers (2025-03-19T14:48:47Z) - RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.06186944042499]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.<n>We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.<n>Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers [90.26363107905344]
Large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery.
No evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas.
arXiv Detail & Related papers (2024-09-06T08:25:03Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - Interesting Scientific Idea Generation using Knowledge Graphs and LLMs: Evaluations with 100 Research Group Leaders [0.6906005491572401]
We introduce SciMuse, which uses 58 million research papers and a large-language model to generate research ideas.<n>We conduct a large-scale evaluation in which over 100 research group leaders ranked more than 4,400 personalized ideas based on their interest.<n>This data allows us to predict research interest using (1) supervised neural networks trained on human evaluations, and (2) unsupervised zero-shot ranking with large-language models.
arXiv Detail & Related papers (2024-05-27T11:00:51Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - Unveiling the Sentinels: Assessing AI Performance in Cybersecurity Peer
Review [4.081120388114928]
In the field of cybersecurity, the practice of double-blind peer review is the de-facto standard.
This paper touches on the holy grail of peer reviewing and aims to shed light on the performance of AI in reviewing for academic security conferences.
We investigate the predictability of reviewing outcomes by comparing the results obtained from human reviewers and machine-learning models.
arXiv Detail & Related papers (2023-09-11T13:51:40Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.