Related papers: B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

URL: http://arxiv.org/abs/2409.08692v1
Date: Fri, 13 Sep 2024 10:22:08 GMT
Title: B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests
Authors: Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, Jianling Sun,
Abstract summary: We show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. We propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge.
Score: 16.19318541132026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.

Related papers

Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models. We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Learning Multiple Initial Solutions to Optimization Problems [52.9380464408756]
Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications. We propose learning to predict emphmultiple diverse initial solutions given parameters that define the problem instance. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required.
arXiv Detail & Related papers (2024-11-04T15:17:19Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
AutoTest: Evolutionary Code Solution Selection with Test Cases [1.4582633500696451]
This study proposes AutoTest, a novel technique that combines automated test case generation with code solution execution. The HumanEval dataset consists of 164 programming problems, and AutoTest achieves approximately a 10% improvement over the baseline method in terms of pass@1 score.
arXiv Detail & Related papers (2024-08-22T04:38:41Z)
Top Pass: Improve Code Generation by Pass@k-Maximized Code Ranking [11.109866941442641]
Top Pass is a code ranking approach that identifies potential correct solutions from a large number of candidates. This enables the user to find the correct solution within as few tries as possible.
arXiv Detail & Related papers (2024-08-11T07:53:51Z)
Dual-Directed Algorithm Design for Efficient Pure Exploration [11.492736493413103]
We consider pure-exploration problems in the context of sequential adaptive experiments with a finite set of alternative options. We derive a sufficient condition for optimality in terms of a notion of strong convergence to the optimal allocation of samples. Our algorithm is optimal for $epsilon$-best-arm identification and thresholding bandit problems.
arXiv Detail & Related papers (2023-10-30T07:29:17Z)
Functional Overlap Reranking for Neural Code Generation [6.665515707408405]
We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation. By quantifying the functional overlap between solution clusters, our approach provides a better ranking strategy for code solutions. Empirical results show that our method achieves remarkable results on the pass@1 score.
arXiv Detail & Related papers (2023-10-16T22:20:31Z)
Coverage Goal Selector for Combining Multiple Criteria in Search-Based Unit Test Generation [26.121557667962556]
Unit testing is critical to ensuring correctness of programming units in a program. Search-based software testing (SBST) is an automated approach to generating test cases.
arXiv Detail & Related papers (2023-09-14T08:35:03Z)
CodeT: Code Generation with Generated Tests [49.622590050797236]
We explore the use of pre-trained language models to automatically generate test cases. CodeT executes the code solutions using the generated test cases, and then chooses the best solution. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2022-07-21T10:18:37Z)
Learning Proximal Operators to Discover Multiple Optima [66.98045013486794]
We present an end-to-end method to learn the proximal operator across non-family problems. We show that for weakly-ized objectives and under mild conditions, the method converges globally.
arXiv Detail & Related papers (2022-01-28T05:53:28Z)
Model Selection in Batch Policy Optimization [88.52887493684078]
We study the problem of model selection in batch policy optimization. We identify three sources of error that any model selection algorithm should optimally trade-off in order to be competitive.
arXiv Detail & Related papers (2021-12-23T02:31:50Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.