RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
- URL: http://arxiv.org/abs/2510.06186v1
- Date: Tue, 07 Oct 2025 17:45:35 GMT
- Title: RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
- Authors: Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Xue Liu, Irwin King, Philip S. Yu,
- Abstract summary: We present RECODE-H, a benchmark of 102 tasks from research papers and repositories.<n>It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration.<n>We also present ReCodeAgent, a framework that integrates feedback into iterative code generation.
- Score: 76.28414843494073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation
Related papers
- ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences [19.81372090301296]
ReplicatorBench is an end-to-end benchmark for evaluating AI agents in research replication across three stages.<n>We develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments.<n>We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access.
arXiv Detail & Related papers (2026-02-11T20:42:10Z) - Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines [1.3986052226424095]
This paper investigates whether agentic systems employing iterative reasoning, evolutionary search, and decomposition can generate more novel and feasible research plans.<n>We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research, and Gemini3 Pro multimodal long-context pipeline.<n>Results reveal varied performance across research domains, with high-performing maintaining feasibility without sacrificing creativity.
arXiv Detail & Related papers (2025-12-24T12:41:31Z) - LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research [32.35279830326718]
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery.<n>However, their capability in reproducing code from research papers, especially in the NLP domain, remains underexplored.<n>We present LMR-BENCH, a benchmark designed to evaluate the capability of LLM agents on code reproduction from Language Modeling Research.
arXiv Detail & Related papers (2025-06-19T07:04:16Z) - MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research [70.72318131988102]
MLR-Bench is a comprehensive benchmark for evaluating AI agents on open-ended machine learning research.<n>MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing.
arXiv Detail & Related papers (2025-05-26T13:18:37Z) - Enhancing Code Generation via Bidirectional Comment-Level Mutual Grounding [6.867043179943195]
Large Language Models (LLMs) have demonstrated unprecedented capability in code generation.<n>Recent studies have shown that developers often struggle with inspecting and fixing incorrect code generated by LLMs.<n>Inspired by the mutual grounding theory in communication, we propose an interactive approach that leverages code comments as a medium for developers and LLMs to establish a shared understanding.
arXiv Detail & Related papers (2025-05-12T17:20:30Z) - FutureGen: A RAG-based Approach to Generate the Future Work of Scientific Article [6.95264395009701]
The Future Work section of a scientific article outlines potential research directions by identifying gaps and limitations of a current study.<n>In this study, we generate future work suggestions from a scientific article.<n>We experimented with various Large Language Models (LLMs) integrated into Retrieval-Augmented Generation (RAG)<n>Our results demonstrate that the RAG-based approach using GPT-4o mini, combined with an LLM feedback mechanism, outperforms other methods based on both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-03-20T06:14:02Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - HumanEvo: An Evolution-aware Benchmark for More Realistic Evaluation of Repository-level Code Generation [36.1669124651617]
We conduct an empirical study to understand Large Language Models' code generation performance within settings that reflect the evolution nature of software development.<n>We use an evolution-aware repository-level code generation dataset, namely HumanEvo, equipped with an automated execution-based evaluation tool.<n>We find that previous evolution-ignored evaluation methods result in inflated performance of LLMs, with performance overestimations ranging from 10.0% to 61.1%.
arXiv Detail & Related papers (2024-06-11T03:19:18Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.<n>ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.<n>We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.