Related papers: Operational Robustness of LLMs on Code Generation

Operational Robustness of LLMs on Code Generation

URL: http://arxiv.org/abs/2602.18800v1
Date: Sat, 21 Feb 2026 11:21:13 GMT
Title: Operational Robustness of LLMs on Code Generation
Authors: Debalina Ghosh Paul, Hong Zhu, Ian Bayley,
Abstract summary: It is now common practice in software development for large language models (LLMs) to be used to generate program code.<n>This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks.<n>Existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete.
Score: 2.9232837969697965
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: It is now common practice in software development for large language models (LLMs) to be used to generate program code. It is desirable to evaluate the robustness of LLMs for this usage. This paper is concerned in particular with how sensitive LLMs are to variations in descriptions of the coding tasks. However, existing techniques for evaluating this robustness are unsuitable for code generation because the input data space of natural language descriptions is discrete. To address this problem, we propose a robustness evaluation method called scenario domain analysis, which aims to find the expected minimal change in the natural language descriptions of coding tasks that would cause the LLMs to produce incorrect outputs. We have formally proved the theoretical properties of the method and also conducted extensive experiments to evaluate the robustness of four state-of-the-art art LLMs: Gemini-pro, Codex, Llamma2 and Falcon 7B, and have found that we are able to rank these with confidence from best to worst. Moreover, we have also studied how robustness varies in different scenarios, including the variations with the topic of the coding task and with the complexity of its sample solution, and found that robustness is lower for more complex tasks and also lower for more advanced topics, such as multi-threading and data structures.

Related papers

Code Fingerprints: Disentangled Attribution of LLM-Generated Code [7.515488307576106]
We study the problem of model-level code attribution, which aims to determine the source LLM responsible for generated code.<n>We propose the Disentangled Code Attribution Network (DCAN), which separates Source-Agnostic semantic information from Source-Specific stylistic representations.<n>We construct the first large-scale benchmark dataset comprising code generated by four widely used Large Language Models (LLMs) across four programming languages.
arXiv Detail & Related papers (2026-03-04T15:58:36Z)
Task-Awareness Improves LLM Generations and Uncertainty [48.857040212979484]
Bayes-optimal responses consistently outperform standard decoding methods like beam search.<n>Our decision-theoretic framework is applicable to any problem that admits a latent response structure.
arXiv Detail & Related papers (2026-01-29T10:16:23Z)
CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
We present CodeSimpleQA, a comprehensive benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions.<n>We also create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-12-22T14:27:17Z)
Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications [0.6813925418351435]
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks.<n>In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements.<n>Our results reveal that LLMs frequently misclassify correct code implementations as either not satisfying requirements'' or containing potential defects.
arXiv Detail & Related papers (2025-08-17T13:07:26Z)
Is LLM-Generated Code More Maintainable \& Reliable than Human-Written Code? [4.893345190925178]
This study compares the internal quality attributes of LLM-generated and human-written code.<n>Our analysis shows that LLM-generated code has fewer bugs and requires less effort to fix them overall.
arXiv Detail & Related papers (2025-08-01T15:17:34Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks [1.9198713957364215]
Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks.<n>They face significant limitations in handling complex, long-context programming challenges.<n>This paper introduces a novel agentic framework for guided code generation''
arXiv Detail & Related papers (2025-01-11T19:21:53Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
A Thorough Examination of Decoding Methods in the Era of LLMs [72.65956436513241]
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of large language models. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
arXiv Detail & Related papers (2024-02-10T11:14:53Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
Testing LLMs on Code Generation with Varying Levels of Prompt Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing. The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z)
Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach [12.214585409361126]
Large language models (LLMs)- based code generation is a complex and powerful black-box model. We propose a novel causal graph-based representation of the prompt and the generated code. We illustrate the insights that our framework can provide by studying over 3 popular LLMs with over 12 prompt adjustment strategies.
arXiv Detail & Related papers (2023-10-10T14:56:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.