A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback
- URL: http://arxiv.org/abs/2507.00699v1
- Date: Tue, 01 Jul 2025 11:51:40 GMT
- Title: A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback
- Authors: Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng,
- Abstract summary: Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored.<n>We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions.<n>We synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants.
- Score: 30.446511584123492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at https://github.com/SYSUSELab/MultiCodeIF.
Related papers
- LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z) - A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models [48.361839372110246]
We develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting.<n>We evaluate 19 large language models and uncover substantial variation in performance across constraint forms.<n>In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters.
arXiv Detail & Related papers (2025-05-12T14:16:55Z) - CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation [22.74831630054096]
We introduce CodeFlowBench, the first benchmark designed to comprehensively evaluate LLMs' ability to perform codeflow.<n>CodeFlowBench comprises 5,258 problems from Codeforces and is continuously updated via an automated pipeline.<n>Extensive experiments on 16 popular LLMs reveal significant performance degradation in multi-turn scenarios.
arXiv Detail & Related papers (2025-04-30T15:45:28Z) - SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z) - EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z) - CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.<n>The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Intention is All You Need: Refining Your Code from Your Intention [19.827036493004435]
This paper proposes an intention-based code refinement technique that enhances the conventional comment-to-code process.<n>Our approach consists of two key phases: Intention Extraction and Intention Guided Revision Generation.<n>Our approach achieves 79% accuracy in intention extraction and up to 66% in code refinement generation.
arXiv Detail & Related papers (2025-02-12T07:26:13Z) - CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously.
We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process.
Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Divide-and-Conquer Meets Consensus: Unleashing the Power of Functions in Code Generation [25.344800819245858]
FunCoder is a code generation framework incorporating the divide-and-conquer strategy with functional consensus.
FunCoder outperforms state-of-the-art methods by +9.8% on average in HumanEval, MBPP, xCodeEval and MATH with GPT-3.5 and GPT-4.
arXiv Detail & Related papers (2024-05-30T14:31:33Z) - MapCoder: Multi-Agent Code Generation for Competitive Problem Solving [3.3856216159724983]
We introduce a new approach to code generation tasks leveraging multi-agent prompting.
Our framework, MapCoder, consists of four LLM agents specifically designed to emulate the stages of program synthesis.
Our method consistently delivers superior performance across various programming languages.
arXiv Detail & Related papers (2024-05-18T22:10:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.