Related papers: DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs

URL: http://arxiv.org/abs/2411.13611v2
Date: Tue, 26 Nov 2024 00:45:22 GMT
Title: DSTC: Direct Preference Learning with Only Self-Generated Tests and Code to Improve Code LMs
Authors: Zhihan Liu, Shenao Zhang, Zhaoran Wang,
Abstract summary: We introduce underlinetextbfDirect Preference Learning with Only underlinetextbfSelf-Generated underlinetextbfTests and underlinetextbfCode (DSTC) DSTC uses only self-generated code snippets and tests to construct reliable preference pairs.
Score: 56.24431208419858
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Direct preference learning offers a promising and computation-efficient beyond supervised fine-tuning (SFT) for improving code generation in coding large language models (LMs). However, the scarcity of reliable preference data is a bottleneck for the performance of direct preference learning to improve the coding accuracy of code LMs. In this paper, we introduce \underline{\textbf{D}}irect Preference Learning with Only \underline{\textbf{S}}elf-Generated \underline{\textbf{T}}ests and \underline{\textbf{C}}ode (DSTC), a framework that leverages only self-generated code snippets and tests to construct reliable preference pairs such that direct preference learning can improve LM coding accuracy without external annotations. DSTC combines a minimax selection process and test-code concatenation to improve preference pair quality, reducing the influence of incorrect self-generated tests and enhancing model performance without the need for costly reward models. When applied with direct preference learning methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO), DSTC yields stable improvements in coding accuracy (pass@1 score) across diverse coding benchmarks, including HumanEval, MBPP, and BigCodeBench, demonstrating both its effectiveness and scalability for models of various sizes. This approach autonomously enhances code generation accuracy across LLMs of varying sizes, reducing reliance on expensive annotated coding datasets.

Related papers

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Direct Preference Optimization Using Sparse Feature-Level Constraints [47.15096507230884]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization [34.29833630422768]
Adversarial Contrastive Decoding (ACD) is an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding. ACD achieves much better safety performance than previous model training-free decoding methods without sacrificing original generation ability.
arXiv Detail & Related papers (2024-06-24T15:51:30Z)
SAIL: Self-Improving Efficient Online Alignment of Large Language Models [56.59644677997827]
Reinforcement Learning from Human Feedback is a key method for aligning large language models with human preferences. Recent literature has focused on designing online RLHF methods but still lacks a unified conceptual formulation. Our approach significantly improves alignment performance on open-sourced datasets with minimal computational overhead.
arXiv Detail & Related papers (2024-06-21T18:05:35Z)
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models [54.14602121129874]
We introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification.
arXiv Detail & Related papers (2024-06-19T13:29:53Z)
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step. Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
Parameter-Efficient Tuning Helps Language Model Alignment [57.27390187540737]
Previous works mainly adopt reinforcement learning (RLHF) and direct preference optimization (DPO) with human feedback for alignment. Controllable generation offers more flexibility with regard to data format. Our approach, alignMEnt with parameter-Efficient Tuning (MEET), improves the quality of control tokens.
arXiv Detail & Related papers (2023-10-01T23:27:14Z)
MBR and QE Finetuning: Training-time Distillation of the Best and Most Expensive Decoding Methods [13.56549575939123]
We propose finetuning and QE finetuning to mitigate the model-perplexity-vs-quality mismatch. We show that even with self-training, these finetuning methods significantly outperform the base model. These findings suggest new ways to leverage monolingual data to achieve improvements in model quality that are on par with, or even exceed, improvements from human-curated data.
arXiv Detail & Related papers (2023-09-19T23:39:07Z)
Cost-Sensitive Self-Training for Optimizing Non-Decomposable Metrics [9.741019160068388]
We introduce the Cost-Sensitive Self-Training (CSST) framework which generalizes the self-training-based methods for optimizing non-decomposable metrics. Our results demonstrate that CSST achieves an improvement over the state-of-the-art in majority of the cases across datasets and objectives.
arXiv Detail & Related papers (2023-04-28T10:31:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.