SciCode: A Research Coding Benchmark Curated by Scientists
- URL: http://arxiv.org/abs/2407.13168v1
- Date: Thu, 18 Jul 2024 05:15:24 GMT
- Title: SciCode: A Research Coding Benchmark Curated by Scientists
- Authors: Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, Cunwei Fan, Xuefei Guo, Roland Haas, Pan Ji, Kittithat Krongchon, Yao Li, Shengyan Liu, Di Luo, Yutao Ma, Hao Tong, Kha Trinh, Chenyu Tian, Zihan Wang, Bohao Wu, Yanyu Xiong, Shengzhu Yin, Minhui Zhu, Kilian Lieret, Yanxin Lu, Genglin Liu, Yufeng Du, Tianhua Tao, Ofir Press, Jamie Callan, Eliu Huerta, Hao Peng,
- Abstract summary: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations.
We created a scientist-curated coding benchmark, SciCode, which includes problems in mathematics, physics, chemistry, biology, and materials science.
Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting.
- Score: 37.900374175754465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.
Related papers
- Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation [58.064940977804596]
A plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently.
Ethical concerns regarding shortcomings of these tools and potential for misuse take a particularly prominent place in our discussion.
arXiv Detail & Related papers (2025-02-07T18:26:45Z) - Exploring Code Comprehension in Scientific Programming: Preliminary Insights from Research Scientists [6.2329239454115415]
This study surveys 57 research scientists from various disciplines to explore their programming backgrounds, practices, and challenges they face regarding code readability.
Scientists mainly use Python and R, relying on documentation for readability.
Our findings show low adoption of code quality tools and a trend towards utilizing large language models to improve code quality.
arXiv Detail & Related papers (2025-01-17T08:47:29Z) - Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System [62.832818186789545]
Virtual Scientists (VirSci) is a multi-agent system designed to mimic the teamwork inherent in scientific research.
VirSci organizes a team of agents to collaboratively generate, evaluate, and refine research ideas.
We show that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas.
arXiv Detail & Related papers (2024-10-12T07:16:22Z) - Artificial intelligence for science: The easy and hard problems [1.8722948221596285]
We study the cognitive science of scientists to understand how humans solve the hard problem.
We use the results to design new computational agents that automatically infer and update their scientific paradigms.
arXiv Detail & Related papers (2024-08-24T18:22:06Z) - DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents [49.74065769505137]
We introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery.
It includes 120 different challenge tasks spanning eight topics each with three levels of difficulty and several parametric variations.
We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks.
arXiv Detail & Related papers (2024-06-10T20:08:44Z) - "Turing Tests" For An AI Scientist [0.0]
This paper proposes a "Turing test for an AI scientist" to assess whether an AI agent can conduct scientific research independently.
We propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains.
arXiv Detail & Related papers (2024-05-22T05:14:27Z) - A Review of Neuroscience-Inspired Machine Learning [58.72729525961739]
Bio-plausible credit assignment is compatible with practically any learning condition and is energy-efficient.
In this paper, we survey several vital algorithms that model bio-plausible rules of credit assignment in artificial neural networks.
We conclude by discussing the future challenges that will need to be addressed in order to make such algorithms more useful in practical applications.
arXiv Detail & Related papers (2024-02-16T18:05:09Z) - SciInstruct: a Self-Reflective Instruction Annotated Dataset for Training Scientific Language Models [57.96527452844273]
We introduce SciInstruct, a suite of scientific instructions for training scientific language models capable of college-level scientific reasoning.
We curated a diverse and high-quality dataset encompassing physics, chemistry, math, and formal proofs.
To verify the effectiveness of SciInstruct, we fine-tuned different language models with SciInstruct, i.e., ChatGLM3 (6B and 32B), Llama3-8B-Instruct, and Mistral-7B: MetaMath.
arXiv Detail & Related papers (2024-01-15T20:22:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.