SciCode: A Research Coding Benchmark Curated by Scientists
- URL: http://arxiv.org/abs/2407.13168v1
- Date: Thu, 18 Jul 2024 05:15:24 GMT
- Title: SciCode: A Research Coding Benchmark Curated by Scientists
- Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng,
- Abstract summary: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations.
We created a scientist-curated coding benchmark, SciCode, which includes problems in mathematics, physics, chemistry, biology, and materials science.
Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting.
- Score: 37.900374175754465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.
Related papers
- Artificial intelligence for science: The easy and hard problems [1.8722948221596285]
We study the cognitive science of scientists to understand how humans solve the hard problem.
We use the results to design new computational agents that automatically infer and update their scientific paradigms.
arXiv Detail & Related papers (2024-08-24T18:22:06Z) - The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery [14.465756130099091]
This paper presents the first comprehensive framework for fully automatic scientific discovery.
We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, and describes its findings.
In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community.
arXiv Detail & Related papers (2024-08-12T16:58:11Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - "Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently.
We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z) - SciQAG: A Framework for Auto-Generated Science Question Answering Dataset with Fine-grained Evaluation [11.129800893611646]
SciQAG is a framework for automatically generating high-quality science question-answer pairs from a large corpus of scientific literature based on large language models (LLMs)
We construct a large-scale, high-quality, open-ended science QA dataset containing 188,042 QA pairs extracted from 22,743 scientific papers across 24 scientific domains.
We also introduce SciQAG-24D, a new benchmark task designed to evaluate the science question-answering ability of LLMs.
arXiv Detail & Related papers (2024-05-16T09:42:37Z) - A Review of Neuroscience-Inspired Machine Learning [58.72729525961739]
Bio-plausible credit assignment is compatible with practically any learning condition and is energy-efficient.
In this paper, we survey several vital algorithms that model bio-plausible rules of credit assignment in artificial neural networks.
We conclude by discussing the future challenges that will need to be addressed in order to make such algorithms more useful in practical applications.
arXiv Detail & Related papers (2024-02-16T18:05:09Z) - SciGLM: Training Scientific Language Models with Self-Reflective
Instruction Annotation and Tuning [60.14510984576027]
SciGLM is a suite of scientific language models able to conduct college-level scientific reasoning.
We apply a self-reflective instruction annotation framework to generate step-by-step reasoning for unlabelled scientific questions.
We fine-tuned the ChatGLM family of language models with SciInstruct, enhancing their scientific and mathematical reasoning capabilities.
arXiv Detail & Related papers (2024-01-15T20:22:21Z) - SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models [70.5763210869525]
We introduce an expansive benchmark suite SciBench for Large Language Model (LLM)
SciBench contains a dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains.
The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%.
arXiv Detail & Related papers (2023-07-20T07:01:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.