Related papers: AutoTest: Evolutionary Code Solution Selection with Test Cases

AutoTest: Evolutionary Code Solution Selection with Test Cases

URL: http://arxiv.org/abs/2408.12125v1
Date: Thu, 22 Aug 2024 04:38:41 GMT
Title: AutoTest: Evolutionary Code Solution Selection with Test Cases
Authors: Zhihua Duan, Jialin Wang,
Abstract summary: This study proposes AutoTest, a novel technique that combines automated test case generation with code solution execution. The HumanEval dataset consists of 164 programming problems, and AutoTest achieves approximately a 10% improvement over the baseline method in terms of pass@1 score.
Score: 1.4582633500696451
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the development of code generation techniques, selecting the correct code solution from multiple candidate solutions has become a crucial task. This study proposes AutoTest, a novel technique that combines automated test case generation with code solution execution to optimize the selection process using an evolutionary genetic algorithm. Firstly, AutoTest utilizes large pre-trained language models such as codegen-16B, code-davinci-002, and incoder-6B to provide code solutions and their corresponding test cases. Then, by executing the code solutions and evaluating their performance on the test cases, a consensus set is formed. Fine-grained ranking is achieved through the selection, mutation, and crossover mechanisms based on the evolutionary genetic algorithm, with the adjustment of alpha and beta parameters. Finally, the best code solution is chosen. AutoTest demonstrates significant performance improvements on the HumanEval benchmark test. The HumanEval dataset consists of 164 programming problems, and AutoTest achieves approximately a 10% improvement over the baseline method in terms of pass@1 score.

Related papers

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding [49.56049319037421]
KodCode is a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data. It comprises question-solution-test triplets that are systematically validated via a self-verification procedure. This pipeline yields a large-scale, robust and diverse coding dataset.
arXiv Detail & Related papers (2025-03-04T19:17:36Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
GenX: Mastering Code and Test Generation with Execution Feedback [7.225594526057816]
We propose a novel approach that concurrently trains a code generation model and a test generation model. We introduce two strategies for test and code data augmentation and a new scoring function for code and test ranking. The results demonstrate that our models, when iteratively trained with an increasing number of test cases and code solutions, outperform those trained on the original dataset.
arXiv Detail & Related papers (2024-12-18T03:18:21Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark [24.14654309612826]
TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. We evaluate several popular models, with sizes ranging from 7B to 405B parameters.
arXiv Detail & Related papers (2024-10-01T14:47:05Z)
B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests [16.19318541132026]
We show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. We propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge.
arXiv Detail & Related papers (2024-09-13T10:22:08Z)
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents [10.730852617039451]
We investigate the capability of LLM-based Code Agents to formalize user issues into test cases. We propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth bug-fixes, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases, with Code Agents designed for code repair exceeding the performance of systems designed for test generation.
arXiv Detail & Related papers (2024-06-18T14:54:37Z)
V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Test-Time Training with Masked Autoencoders [54.983147122777574]
Test-time training adapts to a new test distribution on the fly by optimizing a model for each test input using self-supervision. In this paper, we use masked autoencoders for this one-sample learning problem.
arXiv Detail & Related papers (2022-09-15T17:59:34Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Unit Test Case Generation with Transformers and Focal Context [10.220204860586582]
AthenaTest aims to generate unit test cases by learning from real-world focal methods and developer-written test cases. We introduce Methods2Test, the largest publicly available supervised parallel corpus of unit test case methods and corresponding focal methods in Java. We evaluate AthenaTest on five defects4j projects, generating 25K passing test cases covering 43.7% of the focal methods with only 30 attempts.
arXiv Detail & Related papers (2020-09-11T18:57:36Z)
Bloom Origami Assays: Practical Group Testing [90.2899558237778]
Group testing is a well-studied problem with several appealing solutions. Recent biological studies impose practical constraints for COVID-19 that are incompatible with traditional methods. We develop a new method combining Bloom filters with belief propagation to scale to larger values of n (more than 100) with good empirical results.
arXiv Detail & Related papers (2020-07-21T19:31:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.