Enhancing the Robustness of LLM-Generated Code: Empirical Study and Framework
- URL: http://arxiv.org/abs/2503.20197v2
- Date: Tue, 01 Apr 2025 06:06:38 GMT
- Title: Enhancing the Robustness of LLM-Generated Code: Empirical Study and Framework
- Authors: Zike Li, Mingwei Liu, Anji Li, Kaifeng He, Yanlin Wang, Xin Peng, Zibin Zheng,
- Abstract summary: RobGen is a framework designed to enhance code robustness without requiring model retraining.<n>RobGen reduces the proportion of less robust model-generated code by 20.0%.
- Score: 25.793118619876513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring the robustness of code generated by large language models (LLMs) is crucial for real-world reliability. However, existing evaluations predominantly focus on correctness, often neglecting key robustness concerns such as missing input validation and insufficient error handling. In this paper, we present the first empirical study on the robustness of LLM-generated code. We introduce novel robustness metrics and analyze four state-of-the-art code LLMs, revealing that, on average, 43.1% of their generated code is less robust than human-written counterparts. Notably, over 90% of robustness deficiencies stem from missing conditional checks, with 70% of these omissions occurring in the first line of code. Additionally, in 69% of cases where a conditional statement is necessary but absent, the "if" token still ranks third or higher in the model's predicted token probabilities, indicating an implicit recognition of control structures. Building on these findings, we propose RobGen, a framework designed to enhance code robustness without requiring model retraining. RobGen leverages two model-agnostic techniques: RobGen-Adj, which dynamically adjusts token probabilities during decoding to encourage the inclusion of control structures, and RobGen-Ins, which improves generated code by inserting missing conditionals after generation. Experimental results demonstrate that RobGen reduces the proportion of less robust model-generated code by 20.0%, significantly enhancing code reliability across diverse tasks. As a lightweight and adaptable solution, RobGen effectively mitigates robustness challenges in LLM-generated code. All code and data are available at https://github.com/SYSUSELab/RobGen.
Related papers
- ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Improving LLM-Assisted Secure Code Generation through Retrieval-Augmented-Generation and Multi-Tool Feedback [1.1017250479834206]
Large Language Models (LLMs) can generate code but often introduce security vulnerabilities, logical inconsistencies, and compilation errors.<n>We propose a retrieval-augmented, multi-tool repair workflow in which a single code-generating LLM iteratively refines its outputs.
arXiv Detail & Related papers (2026-01-01T23:34:00Z) - CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement [12.792149709662874]
CodeGrad introduces a principled framework that integrates rigorous verification techniques directly into an iterative generation loop.<n>It treats code as a differentiable variable, converting structured feedback and mathematical constraints into a textual pseudo-gradient.<n>We evaluate CodeGrad on the HumanEval, HumanEval+, and LiveCodeBench benchmarks.
arXiv Detail & Related papers (2025-08-12T22:03:54Z) - LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models [2.891351178680099]
This paper presents a novel framework integrating Code Property Graphs (CPG) with Large Language Models (LLM) for robust vulnerability detection.<n>Our approach's ability to provide a more concise and accurate representation of code snippets enables the analysis of larger code segments.<n> Empirical evaluation demonstrates LLMxCPG's effectiveness across verified datasets, achieving 15-40% improvements in F1-score over state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-22T13:36:33Z) - A Mixture of Linear Corrections Generates Secure Code [20.94236753015922]
Large language models (LLMs) have become proficient at sophisticated code-generation tasks, yet remain ineffective at reliably detecting or avoiding code vulnerabilities.<n>We find that current LLMs encode precise internal representations that distinguish vulnerable from secure code.<n>We develop an inference-time steering technique that subtly modulates the model's token-generation probabilities through a mixture of corrections.
arXiv Detail & Related papers (2025-07-13T06:27:33Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation [51.393569044134445]
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification.<n> Extending RLVR to automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges.<n>We introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs.
arXiv Detail & Related papers (2025-05-30T03:51:06Z) - Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach [6.289275189295223]
We investigate the relationship between code complexity and the success of Large Language Models generated code.<n>We propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs.<n>Experiment results show that our approach makes notable improvements, particularly with a smaller LLM.
arXiv Detail & Related papers (2025-05-29T19:06:14Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset [13.309261291558146]
rStar-Coder is a large-scale, verified dataset of 418K code problems, 580K long-reasoning solutions, and rich test cases of varying difficulty.<n>On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%.
arXiv Detail & Related papers (2025-05-27T15:00:57Z) - CodeCrash: Stress Testing LLM Reasoning under Structural and Semantic Perturbations [36.60702578561009]
Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, yet their robustness in code comprehension and reasoning remains insufficiently explored.<n>We present CodeCrash, a comprehensive stress-testing benchmark comprising 1,279 questions from two established datasets.<n>We systematically evaluate 17 LLMs across input and output prediction tasks using direct and Chain-of-Thought prompting approaches.
arXiv Detail & Related papers (2025-04-19T00:40:28Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets [5.0091559832205155]
We propose an automated source code autocuration technique to improve the quality of training data.<n>We evaluate this method using The Stack v2 dataset, and find that 17% of the code versions in the dataset have newer versions.<n>We expect our results to inspire process improvements for automated data curation, with the potential to enhance the reliability of outputs generated by AI tools.
arXiv Detail & Related papers (2025-01-05T18:54:25Z) - On the Adversarial Robustness of Instruction-Tuned Large Language Models for Code [4.286327408435937]
We assess the impact of diverse input challenges on the functionality and correctness of generated code using rigorous metrics and established benchmarks.<n>Open-source models demonstrate an increased susceptibility to input perturbations, resulting in declines in functional correctness ranging from 12% to 34%.<n>In contrast, commercial models demonstrate relatively greater resilience, with performance degradation ranging from 3% to 24%.
arXiv Detail & Related papers (2024-11-29T07:00:47Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - Understanding Defects in Generated Codes by Language Models [0.669087470775851]
This study categorizes and analyzes 367 identified defects from code snippets generated by Large Language Models.
Error categories indicate key areas where LLMs frequently fail, underscoring the need for targeted improvements.
This paper implemented five prompt engineering techniques, including Scratchpad Prompting, Program of Thoughts Prompting, Chain-of-Thought Prompting, Chain-of-Thought Prompting, and Structured Chain-of-Thought Prompting.
arXiv Detail & Related papers (2024-08-23T21:10:09Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - Contrastive Decoding Improves Reasoning in Large Language Models [55.16503283583076]
We show that Contrastive Decoding achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks.
We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark.
arXiv Detail & Related papers (2023-09-17T00:29:32Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Robust Encodings: A Framework for Combating Adversarial Typos [85.70270979772388]
NLP systems are easily fooled by small perturbations of inputs.
Existing procedures to defend against such perturbations provide guaranteed robustness to worst-case attacks.
We introduce robust encodings (RobEn) that confer guaranteed robustness without making compromises on model architecture.
arXiv Detail & Related papers (2020-05-04T01:28:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.