Related papers: CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

URL: http://arxiv.org/abs/2502.19166v2
Date: Wed, 05 Mar 2025 11:09:06 GMT
Title: CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation
Authors: Kaiwen Yan, Hongcheng Guo, Xuanqing Shi, Jingyi Xu, Yaonan Gu, Zhoujun Li,
Abstract summary: We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions within code generation scenarios.<n>We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
Score: 24.090719826360342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated testing, but also augments developer efficiency through improved maintainability and reusability of code. In this paper, we introduce CodeIF, the first benchmark specifically designed to assess the abilities of LLMs to adhere to task-oriented instructions within diverse code generation scenarios. CodeIF encompasses a broad range of tasks, including function synthesis, error debugging, algorithmic refactoring, and code explanation, thereby providing a comprehensive suite to evaluate model performance across varying complexity levels and programming domains. We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks. The experimental results offer valuable insights into how well current models align with human instructions, as well as the extent to which they can generate consistent, maintainable, and contextually relevant code. Our findings not only underscore the critical role that instruction-following LLMs can play in modern software development, but also illuminate pathways for future research aimed at enhancing their adaptability, reliability, and overall effectiveness in automated code generation.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving [3.683434365857386]
We present ClarifyCoder, a novel framework with synthetic data generation and instruction-tuning. We argue that the fundamental ability to recognize and query ambiguous requirements should be intrinsic to the models themselves. Our approach consists of two main components: (1) a data synthesis technique that augments existing programming datasets with scenarios requiring clarification to generate clarification-aware training data, and (2) a fine-tuning strategy that teaches models to prioritize seeking clarification over immediate code generation when faced with incomplete or ambiguous requirements.
arXiv Detail & Related papers (2025-04-23T00:34:39Z)
LLM Benchmarking with LLaMA2: Evaluating Code Development Performance Across Multiple Programming Languages [0.1906498126334485]
This paper evaluates the capabilities of the Llama 2-70B model in automating scientific applications written in programming languages.<n>We assess the model's capacity to generate code, documentation, and unit tests, as well as its ability to translate existing code between programming languages.<n>Our results indicate that while Llama 2-70B frequently generates syntactically correct and functional code for simpler numerical tasks, it encounters substantial difficulties with more complex, parallelized, or distributed computations.
arXiv Detail & Related papers (2025-03-24T23:46:14Z)
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation [10.438717413104062]
We introduce bench, a benchmark for evaluating Large Language Models' instruction-following capabilities. bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements. We evaluate nine prominent LLMs using bench, and the experimental results reveal a significant disparity between their basic programming capability and instruction-following capability.
arXiv Detail & Related papers (2025-03-05T09:47:02Z)
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications [0.9105696129628794]
Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code.
arXiv Detail & Related papers (2025-03-03T07:17:30Z)
Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs [53.00384299879513]
In large language models (LLMs), code and reasoning reinforce each other.<n>Code provides verifiable execution paths, enforces logical decomposition, and enables runtime validation.<n>We identify key challenges and propose future research directions to strengthen this synergy.
arXiv Detail & Related papers (2025-02-26T18:55:42Z)
SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks, such as code understanding and code generation.<n>However, an equally important yet underexplored question is whether LLMs can serve as general-purpose surrogate code executors.<n>This study provides empirical insights into the feasibility of using LLMs as surrogate code executors.
arXiv Detail & Related papers (2025-02-16T15:38:19Z)
Resource-Efficient & Effective Code Summarization [3.512140256677132]
GreenAI techniques, such as QLoRA, offer a promising path for dealing with large models' sustainability.<n>Our study evaluates two state-of-the-art CLMs across two programming languages: Python and Java.<n>Results show that QLoRA enables efficient fine-tuning of CLMs for code summarization.
arXiv Detail & Related papers (2025-02-05T21:06:30Z)
Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks [1.9198713957364215]
Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks.<n>They face significant limitations in handling complex, long-context programming challenges.<n>This paper introduces a novel agentic framework for guided code generation''
arXiv Detail & Related papers (2025-01-11T19:21:53Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [76.59316249991657]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
RGD: Multi-LLM Based Agent Debugger via Refinement and Generation Guidance [0.6062751776009752]
Large Language Models (LLMs) have shown incredible potential in code generation tasks. LLMs can generate code based on task descriptions, but accuracy remains limited. We introduce a novel architecture of LLM-based agents for code generation and automatic debug: Refinement and Guidance debugger (RGD) RGD decomposes the code generation task into multiple steps, ensuring a clearer workflow and enabling iterative code refinement based on self-reflection and feedback.
arXiv Detail & Related papers (2024-10-02T05:07:02Z)
A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs. We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.