STELP: Secure Transpilation and Execution of LLM-Generated Programs
- URL: http://arxiv.org/abs/2601.05467v3
- Date: Thu, 15 Jan 2026 07:51:03 GMT
- Title: STELP: Secure Transpilation and Execution of LLM-Generated Programs
- Authors: Swapnil Shinde, Sahil Wadhwa, Andy Luo, Akshay Gupta, Mohammad Shahed Sorower,
- Abstract summary: Large Language Models (LLMs) can solve software development-related tasks such as code generation.<n>LLMs generated code could be unstable or erroneous and contain vulnerabilities that could lead to widespread system malfunctions.<n>This paper proposes a Secure Transpiler and Executor of LLM-Generated Program (STELP) capable of executing LLM-generated code in a controlled and safe manner.
- Score: 2.986494009382113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.
Related papers
- Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model [60.60587869092729]
Large language models (LLMs) are increasingly used in software development, yet their tendency to generate insecure code remains a major barrier to real-world deployment.<n>We propose SecCoderX, an online reinforcement learning framework for functionality-preserving secure code generation.
arXiv Detail & Related papers (2026-02-07T07:42:07Z) - TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code [46.747768845221735]
Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages.<n>Their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems.<n>This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code.
arXiv Detail & Related papers (2025-10-13T08:44:01Z) - A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z) - Large Language Model Unlearning for Source Code [65.42425213605114]
PROD is a novel unlearning approach that enables LLMs to forget undesired code content while preserving their code generation capabilities.<n>Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches.
arXiv Detail & Related papers (2025-06-20T16:27:59Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models [0.769672852567215]
This paper uses predefined security parameters to evaluate the security compliance of LLM-generated code across multiple models.<n>The analysis reveals critical vulnerabilities in authentication mechanisms, session management, input validation and HTTP security headers.<n>Our findings underscore that human expertise is crucial to ensure secure software deployment or review of LLM-generated code.
arXiv Detail & Related papers (2025-04-29T10:23:11Z) - Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts [5.718926328180089]
This paper introduces a jailbreaking approach, CodeJailbreaker, designed to uncover safety concerns in code generation.<n> Experiments on the recently-released RMCBench benchmark demonstrate that CodeJailbreaker markedly surpasses the conventional jailbreaking strategy.
arXiv Detail & Related papers (2025-03-23T06:06:12Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs [2.7138982369416866]
Large Language Models (LLMs) have revolutionized automated code generation in software engineering.
However, concerns have arisen regarding the security and quality of the generated code.
Our research aims to tackle these issues by introducing a framework for secure behavioral learning of LLMs.
arXiv Detail & Related papers (2024-06-18T11:29:34Z) - Can LLMs Patch Security Issues? [1.3299507495084417]
Large Language Models (LLMs) have shown impressive proficiency in code generation.
LLMs share a weakness with their human counterparts: producing code that inadvertently has security vulnerabilities.
We propose Feedback-Driven Security Patching (FDSP), where LLMs automatically refine generated, vulnerable code.
arXiv Detail & Related papers (2023-11-13T08:54:37Z) - SALLM: Security Assessment of Generated Code [0.5137309756089941]
This paper describes SALLM, a framework to benchmark Large Language Models' abilities to generate secure code systematically.
The framework has three major components: a novel dataset of security-centric Python prompts, assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
arXiv Detail & Related papers (2023-11-01T22:46:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.