Related papers: RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

URL: http://arxiv.org/abs/2510.06186v1
Date: Tue, 07 Oct 2025 17:45:35 GMT
Title: RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
Authors: Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Xue Liu, Irwin King, Philip S. Yu,
Abstract summary: We present RECODE-H, a benchmark of 102 tasks from research papers and repositories.<n>It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration.<n>We also present ReCodeAgent, a framework that integrates feedback into iterative code generation.
Score: 76.28414843494073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

Related papers

ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences [19.81372090301296]
ReplicatorBench is an end-to-end benchmark for evaluating AI agents in research replication across three stages.<n>We develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments.<n>We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access.
arXiv Detail & Related papers (2026-02-11T20:42:10Z)
Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines [1.3986052226424095]
This paper investigates whether agentic systems employing iterative reasoning, evolutionary search, and decomposition can generate more novel and feasible research plans.<n>We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research, and Gemini3 Pro multimodal long-context pipeline.<n>Results reveal varied performance across research domains, with high-performing maintaining feasibility without sacrificing creativity.
arXiv Detail & Related papers (2025-12-24T12:41:31Z)
LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research [32.35279830326718]
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery.<n>However, their capability in reproducing code from research papers, especially in the NLP domain, remains underexplored.<n>We present LMR-BENCH, a benchmark designed to evaluate the capability of LLM agents on code reproduction from Language Modeling Research.
arXiv Detail & Related papers (2025-06-19T07:04:16Z)
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research [70.72318131988102]
MLR-Bench is a comprehensive benchmark for evaluating AI agents on open-ended machine learning research.<n>MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing.
arXiv Detail & Related papers (2025-05-26T13:18:37Z)
Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding [6.867043179943195]
Large Language Models (LLMs) have demonstrated unprecedented capability in code generation.<n>Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs.<n>Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding.
arXiv Detail & Related papers (2025-05-12T17:20:30Z)
FutureGen: A RAG-based Approach to Generate the Future Work of Scientific Article [6.95264395009701]
The Future Work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study.<n>In this study, we generate future work suggestions from a scientific article.<n>We experimented with various Large Language Models (LLMs) integrated into Retrieval-Augmented Generation (RAG)<n>Our results demonstrate that the RAG-based approach using GPT-4o mini, combined with an LLM feedback mechanism, outperforms other methods based on both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-03-20T06:14:02Z)
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [36.1669124651617]
We conduct an empirical study to understand Large Language Models' code generation performance within settings that reflect the evolution nature of software development.<n>We use an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool.<n>We find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1%.
arXiv Detail & Related papers (2024-06-11T03:19:18Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.