Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
- URL: http://arxiv.org/abs/2410.21071v1
- Date: Mon, 28 Oct 2024 14:34:36 GMT
- Title: Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
- Authors: Eitan Farchi, Shmulik Froimovich, Rami Katan, Orna Raz,
- Abstract summary: This work introduces a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark.
The benchmark is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs.
Our approach enables the creation of high quality code task solutions.
- Score: 0.8274693573069442
- License:
- Abstract: LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.
Related papers
- Evaluation of Code LLMs on Geospatial Code Generation [1.6834474847800562]
Large Language Models (LLMs) can generate Python code for data science and machine learning applications.
Here, we show how we constructed an evaluation benchmark for code generation models, based on a selection of geospatial tasks.
Our dataset will hopefully contribute to the development new models capable of solving geospatial coding tasks with high accuracy.
arXiv Detail & Related papers (2024-10-06T20:34:03Z) - Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis [0.7580487359358722]
Large Language Models (LLMs) struggle with accuracy and are unsuitable for high-risk applications.
We introduce a solution that divides the code generation into two parts; one to be handled by an LLM and one to be handled by formal methods-based program synthesis.
arXiv Detail & Related papers (2024-09-18T15:59:06Z) - Beyond Code Generation: Assessing Code LLM Maturity with Postconditions [9.521621889147362]
We propose a code Large Language Model maturity model based on the postcondition generation problem.
We augment the EvalPlus dataset to a postcondition testing benchmark, and evaluate several open-sourced models.
arXiv Detail & Related papers (2024-07-19T08:34:30Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks.
This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - Mutation-based Consistency Testing for Evaluating the Code Understanding
Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages.
We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions.
We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs.
We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.