Related papers: SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

URL: http://arxiv.org/abs/2504.00255v1
Date: Mon, 31 Mar 2025 22:02:24 GMT
Title: SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
Authors: Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He,
Abstract summary: This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers.<n>To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024.<n>Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions.
Score: 16.80818230868491
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study evaluates large language models (LLMs) in generating code from algorithm descriptions from recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a multi-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implement solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful Non-Reasoning LLMs and Reasoning LLMs as foundational models. The best-performing LLM using Sci-Reproducer achieves only 39% execution accuracy, highlighting the benchmark's difficulty.Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We will open-source our benchmark, and code at https://github.com/xyzCS/SciReplicate-Bench.

Related papers

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.09163579304332]
We introduce PaperCoder, a framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations.
arXiv Detail & Related papers (2025-04-24T01:57:01Z)
PaperBench: Evaluating AI's Ability to Replicate AI Research [3.4567792239799133]
PaperBench is a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.<n>Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch.<n>PaperBench contains 8,316 individually gradable tasks.
arXiv Detail & Related papers (2025-04-02T15:55:24Z)
Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics [1.3707925738322797]
We focus on LLM-based code evaluation and attempt to fill in the existing gaps.<n>We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement.<n>Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings.
arXiv Detail & Related papers (2025-03-31T11:59:43Z)
Code Summarization Beyond Function Level [0.213063058314067]
This study investigated the effectiveness of code summarization models beyond the function level.<n>The fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization.<n> Repository-level summarization exhibited promising potential but requires significant computational resources.
arXiv Detail & Related papers (2025-02-23T20:31:21Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
On the Design and Analysis of LLM-Based Algorithms [74.7126776018275]
Large language models (LLMs) are used as sub-routines in algorithms. LLMs have achieved remarkable empirical success. Our proposed framework holds promise for advancing LLM-based algorithms.
arXiv Detail & Related papers (2024-07-20T07:39:07Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.