ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
- URL: http://arxiv.org/abs/2503.07010v1
- Date: Mon, 10 Mar 2025 07:47:27 GMT
- Title: ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
- Authors: Kaiyuan Liu, Youcheng Pan, Jing Li, Daojing He, Yang Xiang, Yexing Du, Tianrun Gao,
- Abstract summary: We introduce ProjectEval, a new benchmark for LLM agents project-level code generation's automated evaluation by simulating user interaction.<n>ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators.<n>We find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects.
- Score: 10.748303323995986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of LLM agents' code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project-level code generation's automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real-world production.
Related papers
- Assessing LLMs for Front-end Software Architecture Knowledge [0.0]
Large Language Models (LLMs) have demonstrated significant promise in automating software development tasks.<n>This study investigates the capabilities of an LLM in understanding, reproducing, and generating structures within the VIPER architecture.<n> Experimental results, using ChatGPT 4 Turbo 2024-04-09, reveal that the LLM excelled in higher-order tasks like evaluating and creating, but faced challenges with lower-order tasks requiring precise retrieval of architectural details.
arXiv Detail & Related papers (2025-02-26T19:33:35Z) - FVEval: Understanding Language Model Capabilities in Formal Verification of Digital Hardware [4.480157114854711]
We present FVEval, the first comprehensive benchmark for characterizing large language models (LLMs) performance in tasks pertaining to formal verification (FV)
The benchmark consists of three sub-tasks that measure LLM capabilities at different levels.
We present both collections of expert-written verification collateral and methodologies to scalably generate synthetic examples aligned with FV.
arXiv Detail & Related papers (2024-10-15T21:48:57Z) - A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks.
With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z) - CIBench: Evaluating Your LLMs with a Code Interpreter Plugin [68.95137938214862]
We propose an interactive evaluation framework, named CIBench, to comprehensively assess LLMs' ability to utilize code interpreters for data science tasks.
The evaluation dataset is constructed using an LLM-human cooperative approach and simulates an authentic workflow by leveraging consecutive and interactive IPython sessions.
We conduct extensive experiments to analyze the ability of 24 LLMs on CIBench and provide valuable insights for future LLMs in code interpreter utilization.
arXiv Detail & Related papers (2024-07-15T07:43:55Z) - Towards more realistic evaluation of LLM-based code generation: an experimental study and beyond [36.1669124651617]
We conduct an empirical study to understand Large Language Models' code generation performance within settings that reflect the evolving nature of software development.
We find that previous evolving-ignored evaluation approaches lead to inflated performance of the LLMs, ranging from 10.0% to 61.1%.
arXiv Detail & Related papers (2024-06-11T03:19:18Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming [12.355284125578342]
Large Language Models (LLMs) have become a focal point in modern software development.
LLMs offer the potential to significantly augment developer productivity by serving as intelligent, chat-driven programming assistants.
However, each system requires the LLM to be honed to its set of workspaces to ensure the best performance.
arXiv Detail & Related papers (2024-02-22T03:51:34Z) - CodePori: Large-Scale System for Autonomous Software Development Using Multi-Agent Technology [4.2990995991059275]
Large Language Models (LLMs) and Generative Pre-trained Transformers (GPTs) have transformed the field of Software Engineering.
We introduce CodePori, a novel system designed to automate code generation for large and complex software projects.
Results: CodePori is able to generate running code for large-scale projects, aligned with the typical software development process.
arXiv Detail & Related papers (2024-02-02T13:42:50Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks.
We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.