Related papers: Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks

Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks

URL: http://arxiv.org/abs/2410.21071v1
Date: Mon, 28 Oct 2024 14:34:36 GMT
Title: Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
Authors: Eitan Farchi, Shmulik Froimovich, Rami Katan, Orna Raz,
Abstract summary: This work introduces a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The benchmark is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. Our approach enables the creation of high quality code task solutions.
Score: 0.8274693573069442
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.

Related papers

On LLM-Assisted Generation of Smart Contracts from Business Processes [0.08192907805418582]
Large language models (LLMs) have changed the reality of how software is produced.<n>We present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions.<n>Our results show that LLM performance falls short of the perfect reliability required for smart contract development.
arXiv Detail & Related papers (2025-07-30T20:39:45Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [54.965787768076254]
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A.<n>We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization.
arXiv Detail & Related papers (2025-07-22T13:40:26Z)
CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [69.684886175768]
Large language models (LLMs) have shown promising performance in automated code generation. In this paper, we propose CodeRAG, a retrieval-augmented code generation framework. Experiments show that CodeRAG achieves significant improvements compared to no RAG scenarios.
arXiv Detail & Related papers (2025-04-14T09:51:23Z)
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z)
Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications [0.9105696129628794]
Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate executable code.
arXiv Detail & Related papers (2025-03-03T07:17:30Z)
Pragmatic Reasoning improves LLM Code Generation [35.78260347663757]
We propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework. We evaluate CodeRSA using one of the latest Large Language Models on a popular code generation dataset.
arXiv Detail & Related papers (2025-02-20T12:44:26Z)
Evaluation of Code LLMs on Geospatial Code Generation [1.6834474847800562]
Large Language Models (LLMs) can generate Python code for data science and machine learning applications. Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks. Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy.
arXiv Detail & Related papers (2024-10-06T20:34:03Z)
Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis [0.7580487359358722]
Large Language Models (LLMs) struggle with accuracy and are unsuitable for high-risk applications. We introduce a solution that divides the code generation into two parts; one to be handled by an LLM and one to be handled by formal methods-based program synthesis.
arXiv Detail & Related papers (2024-09-18T15:59:06Z)
Beyond Code Generation: Assessing Code LLM Maturity with Postconditions [9.521621889147362]
We propose a code Large Language Model maturity model based on the postcondition generation problem. We augment the EvalPlus dataset to a postcondition testing benchmark, and evaluate several open-sourced models.
arXiv Detail & Related papers (2024-07-19T08:34:30Z)
CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation. We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z)
A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks. This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z)
CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities. We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution. We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages. We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents. We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.