SciCode: A Research Coding Benchmark Curated by Scientists
- URL: http://arxiv.org/abs/2407.13168v1
- Date: Thu, 18 Jul 2024 05:15:24 GMT
- Title: SciCode: A Research Coding Benchmark Curated by Scientists
- Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng,
- Abstract summary: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations.
We created a scientist-curated coding benchmark, SciCode, which includes problems in mathematics, physics, chemistry, biology, and materials science.
Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting.
- Score: 37.900374175754465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.
Related papers
- Scaling Laws in Scientific Discovery with AI and Robot Scientists [72.3420699173245]
An autonomous generalist scientist (AGS) concept combines agentic AI and embodied robotics to automate the entire research lifecycle.
AGS aims to significantly reduce the time and resources needed for scientific discovery.
As these autonomous systems become increasingly integrated into the research process, we hypothesize that scientific discovery might adhere to new scaling laws.
arXiv Detail & Related papers (2025-03-28T14:00:27Z) - CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning [7.41837850475371]
We introduce CURIE, a benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving.
This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines.
We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning.
arXiv Detail & Related papers (2025-03-14T17:53:03Z) - Evaluating Sakana's AI Scientist for Autonomous Research: Wishful Thinking or an Emerging Reality Towards 'Artificial Research Intelligence' (ARI)? [19.524056927240498]
Sakana recently introduced the 'AI Scientist', claiming to conduct research autonomously, i.e. they imply to have achieved what we term Artificial Research Intelligence (ARI)
Our evaluation of the AI Scientist reveals critical shortcomings.
arXiv Detail & Related papers (2025-02-20T06:22:03Z) - Exploring Code Comprehension in Scientific Programming: Preliminary Insights from Research Scientists [6.2329239454115415]
This study surveys 57 research scientists from various disciplines to explore their programming backgrounds, practices, and challenges they face regarding code readability.
Scientists mainly use Python and R, relying on documentation for readability.
Our findings show low adoption of code quality tools and a trend towards utilizing large language models to improve code quality.
arXiv Detail & Related papers (2025-01-17T08:47:29Z) - Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.
VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.
We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z) - Artificial intelligence for science: The easy and hard problems [1.8722948221596285]
We study the cognitive science of scientists to understand how humans solve the hard problem.
We use the results to design new computational agents that automatically infer and update their scientific paradigms.
arXiv Detail & Related papers (2024-08-24T18:22:06Z) - The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [14.465756130099091]
This paper presents the first comprehensive framework for fully automatic scientific discovery.
We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, and describes its findings.
In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.
arXiv Detail & Related papers (2024-08-12T16:58:11Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - "Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently.
We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z) - A Review of Neuroscience-Inspired Machine Learning [58.72729525961739]
Bio-plausible credit assignment is compatible with practically any learning condition and is energy-efficient.
In this paper, we survey several vital algorithms that model bio-plausible rules of credit assignment in artificial neural networks.
We conclude by discussing the future challenges that will need to be addressed in order to make such algorithms more useful in practical applications.
arXiv Detail & Related papers (2024-02-16T18:05:09Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.