Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity
- URL: http://arxiv.org/abs/2508.21634v1
- Date: Fri, 29 Aug 2025 13:51:28 GMT
- Title: Human-Written vs. AI-Generated Code: A Large-Scale Study of Defects, Vulnerabilities, and Complexity
- Authors: Domenico Cotroneo, Cristina Improta, Pietro Liguori,
- Abstract summary: We present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder.<n>Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weaknession.<n>We find that AI-generated code is generally simpler and more prone to unused constructs and hardcoded, while human-written code exhibits greater structural complexity and a higher concentration of maintainability issues.
- Score: 4.478789600295493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As AI code assistants become increasingly integrated into software development workflows, understanding how their code compares to human-written programs is critical for ensuring reliability, maintainability, and security. In this paper, we present a large-scale comparison of code authored by human developers and three state-of-the-art LLMs, i.e., ChatGPT, DeepSeek-Coder, and Qwen-Coder, on multiple dimensions of software quality: code defects, security vulnerabilities, and structural complexity. Our evaluation spans over 500k code samples in two widely used languages, Python and Java, classifying defects via Orthogonal Defect Classification and security vulnerabilities using the Common Weakness Enumeration. We find that AI-generated code is generally simpler and more repetitive, yet more prone to unused constructs and hardcoded debugging, while human-written code exhibits greater structural complexity and a higher concentration of maintainability issues. Notably, AI-generated code also contains more high-risk security vulnerabilities. These findings highlight the distinct defect profiles of AI- and human-authored code and underscore the need for specialized quality assurance practices in AI-assisted programming.
Related papers
- AI builds, We Analyze: An Empirical Study of AI-Generated Build Code Quality [0.0]
The rapid adoption of AI coding agents for software development has raised important questions about the quality and maintainability of the code they produce.<n>This data mining challenge focuses on AIDev, the first large-scale, openly available dataset capturing agent-pull requests from real-world GitHub repositories.<n>We identified 364 maintainability and security-related build smells across varying severity levels, indicating that AI-generated build code can introduce quality issues.
arXiv Detail & Related papers (2026-01-23T15:40:28Z) - Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics [6.108440460022983]
We investigate the concept of AI-friendly code' via a dataset of 5,000 Python files from competitive programming.<n>Our findings confirm that human-friendly code is also more compatible with AI tooling.<n>These results suggest that organizations can use CodeHealth to guide where AI interventions are lower risk and where additional human oversight is warranted.
arXiv Detail & Related papers (2026-01-05T15:23:55Z) - AI Code in the Wild: Measuring Security Risks and Ecosystem Shifts of AI-Generated Code in Modern Software [12.708926174194199]
We present the first large-scale empirical study of AI-generated code (AIGCode) in the wild.<n>We build a high-precision detection pipeline and a benchmark to distinguish AIGCode from human-written code.<n>This lets us label commits, files, and functions along a human/AI axis and trace how AIGCode moves through projects and vulnerability life cycles.
arXiv Detail & Related papers (2025-12-21T02:26:29Z) - A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z) - Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.<n>It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.<n>Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - SOK: Exploring Hallucinations and Security Risks in AI-Assisted Software Development with Insights for LLM Deployment [0.0]
Large Language Models (LLMs) such as GitHub Copilot, ChatGPT, Cursor AI, and Codeium AI have revolutionized the coding landscape.<n>This paper provides a comprehensive analysis of the benefits and risks associated with AI-powered coding tools.
arXiv Detail & Related papers (2025-01-31T06:00:27Z) - Comparing Human and LLM Generated Code: The Jury is Still Out! [8.456554883523472]
We compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code.<n>We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases.<n>We observe security flaws in code generated by both humans and GPT-4, but GPT-4 code included more severe outliers.
arXiv Detail & Related papers (2025-01-28T11:11:36Z) - RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation.
RedCode-Exec provides challenging prompts that could lead to risky code execution.
RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - LLM-Powered Code Vulnerability Repair with Reinforcement Learning and
Semantic Reward [3.729516018513228]
We introduce a multipurpose code vulnerability analysis system textttSecRepair, powered by a large language model, CodeGen2.
Inspired by how humans fix code issues, we propose an instruction-based dataset suitable for vulnerability analysis with LLMs.
We identify zero-day and N-day vulnerabilities in 6 Open Source IoT Operating Systems on GitHub.
arXiv Detail & Related papers (2024-01-07T02:46:39Z) - Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code.
We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.