Related papers: Dynamic Scaling of Unit Tests for Code Reward Modeling

Dynamic Scaling of Unit Tests for Code Reward Modeling

URL: http://arxiv.org/abs/2501.01054v1
Date: Thu, 02 Jan 2025 04:33:31 GMT
Title: Dynamic Scaling of Unit Tests for Code Reward Modeling
Authors: Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang,
Abstract summary: Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation.<n>We propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling.
Score: 27.349232888627558
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Unit Testing Past vs. Present: Examining LLMs' Impact on Defect Detection and Efficiency [2.4936576553283283]
The integration of Large Language Models (LLMs) into software engineering has shown potential to enhance productivity. This paper investigates whether LLM support improves defect detection effectiveness during unit testing.
arXiv Detail & Related papers (2025-02-13T22:27:55Z)
A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing [8.22619177301814]
Large Language Models (LLMs) have shown potential in various unit testing tasks.<n>We present a large-scale empirical study on fine-tuning LLMs for unit testing.
arXiv Detail & Related papers (2024-12-21T13:28:11Z)
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for automated test case generation. Because LLMs are trained on vast amounts of open-source code, they often generate test cases that do not adhere to best practices. We propose Reinforcement Learning from Static Quality Metrics (RLSQM) to generate high-quality unit tests based on static analysis-based quality metrics.
arXiv Detail & Related papers (2024-12-18T20:20:01Z)
ViUniT: Visual Unit Tests for More Robust Visual Programming [104.55763189099125]
When models answer correctly, they produce incorrect programs 33% of the time.<n>We propose Visual Unit Testing (ViUniT), a framework to improve the reliability of visual programs by automatically generating unit tests.
arXiv Detail & Related papers (2024-12-12T01:36:18Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.833381226332574]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types. We show that even the most advanced LLMs fail to solve these problems end-to-end in text. Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z)
On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts. We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries. Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z)
Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, their large size makes their inference slow and computationally expensive. We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z)
Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation [12.503002900186997]
Large Language Models (LLMs) have gained popularity for code generation, including the automated creation of test cases. LLMs are often trained on vast amounts of publicly available code, which may include test cases that do not adhere to best practices. We propose a novel technique called Reinforcement Learning from Static Quality Metrics (RLSQM)
arXiv Detail & Related papers (2023-10-03T18:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.