Related papers: CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement

CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement

URL: http://arxiv.org/abs/2508.10059v2
Date: Tue, 02 Sep 2025 20:11:20 GMT
Title: CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement
Authors: Yueke Zhang, Yifan Zhang, Kevin Leach, Yu Huang,
Abstract summary: CodeGrad introduces a principled framework that integrates rigorous verification techniques directly into an iterative generation loop.<n>It treats code as a differentiable variable, converting structured feedback and mathematical constraints into a textual pseudo-gradient.<n>We evaluate CodeGrad on the HumanEval, HumanEval+, and LiveCodeBench benchmarks.
Score: 12.792149709662874
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, they often produce solutions that lack guarantees of correctness, robustness, and efficiency. This limitation is particularly acute in domains requiring strict constraints. CodeGrad introduces a principled framework that integrates rigorous verification techniques directly into an iterative LLM-based generation loop. It uniquely treats code as a differentiable variable, converting structured feedback and mathematical constraints into a textual pseudo-gradient. This gradient guides the model to iteratively refine solutions, ensuring they are not only functional but also robust and mathematically justified. We evaluate CodeGrad on the HumanEval, HumanEval+, and LiveCodeBench benchmarks. Our implementation outperforms strong baselines, achieving an absolute improvement of up to 27% on HumanEval and a 41% relative improvement on the challenging LiveCodeBench V6. StructuredGrad generates mathematically justified code that is robust and efficient, paving the way for reliable AI-assisted software development in high-stakes applications.

Related papers

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models [26.385183692191873]
Large Language Models (LLMs) are changing the coding paradigm, yet synthesizingically sophisticated and robust code still remains a critical challenge.<n>We propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fineTuning (TAROT) to address this need.
arXiv Detail & Related papers (2026-02-17T09:29:18Z)
ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation [27.9837392531619]
We propose a knowledge-infused framework named ShortCoder to optimize code generation efficiency.<n>ShortCoder consistently outperforms state-of-the-art methods on HumanEval.
arXiv Detail & Related papers (2026-01-14T18:57:31Z)
Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z)
Beyond Code Similarity: Benchmarking the Plausibility, Efficiency, and Complexity of LLM-Generated Smart Contracts [3.3672086394822762]
LLMs produce code with high semantic similarity to real contracts.<n>Only 20% to 26% of zero-shot generations behave identically to ground-truth implementations under testing.<n>Retrieval-Augmented Generation markedly improves performance, boosting functional correctness by up to 45%.
arXiv Detail & Related papers (2025-11-20T10:47:59Z)
Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [68.00108157244952]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
LLM4EFFI: Leveraging Large Language Models to Enhance Code Efficiency and Correctness [38.399282089600284]
Large Language Models (LLMs) have demonstrated impressive performance in code generation.<n>tool: ulineLarge ulineLanguage ulineModel for Code ulineEfficiency is a novel framework that enables LLMs to generate code that balances both efficiency and correctness.
arXiv Detail & Related papers (2025-02-17T07:01:18Z)
ConAIR:Consistency-Augmented Iterative Interaction Framework to Enhance the Reliability of Code Generation [17.68163468068264]
We propose a Consistency-Augmented Iterative Interaction Framework to enhance the reliability of Code Generation, ConAIR. We show that with minimal human effort, performance can be significantly enhanced.
arXiv Detail & Related papers (2024-11-23T15:26:24Z)
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code [52.70310361822519]
We propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency.<n>CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases.
arXiv Detail & Related papers (2024-10-08T01:36:15Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness? [12.862825053595934]
ECCO is a benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code editing. We find that adding execution information often helps maintain functional correctness, and NL feedback enhances more on efficiency.
arXiv Detail & Related papers (2024-07-19T05:47:40Z)
Exploring Data-Efficient Adaptation of Large Language Models for Code Generation [64.5583894165813]
We propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation.<n> Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data.
arXiv Detail & Related papers (2024-02-29T16:09:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.