CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs
- URL: http://arxiv.org/abs/2506.11059v1
- Date: Tue, 27 May 2025 03:25:12 GMT
- Title: CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs
- Authors: Hanxi Guo, Siyuan Cheng, Kaiyuan Zhang, Guangyu Shen, Xiangyu Zhang,
- Abstract summary: Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code.<n>Existing benchmarks fall short -- most cover only a limited set of programming languages and rely on less capable generative models.<n>We present CodeMirage, a benchmark that spans ten widely used programming languages and includes both original and paraphrased code samples.
- Score: 15.25980318643715
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code. While these models boost programming productivity, their misuse introduces critical risks, including code plagiarism, license violations, and the propagation of insecure programs. As a result, robust detection of AI-generated code is essential. To support the development of such detectors, a comprehensive benchmark that reflects real-world conditions is crucial. However, existing benchmarks fall short -- most cover only a limited set of programming languages and rely on less capable generative models. In this paper, we present CodeMirage, a comprehensive benchmark that addresses these limitations through three major advancements: (1) it spans ten widely used programming languages, (2) includes both original and paraphrased code samples, and (3) incorporates outputs from ten state-of-the-art production-level LLMs, including both reasoning and non-reasoning models from six major providers. Using CodeMirage, we evaluate ten representative detectors across four methodological paradigms under four realistic evaluation configurations, reporting results using three complementary metrics. Our analysis reveals nine key findings that uncover the strengths and weaknesses of current detectors, and identify critical challenges for future work. We believe CodeMirage offers a rigorous and practical testbed to advance the development of robust and generalizable AI-generated code detectors.
Related papers
- Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection [0.0]
Three industry-standard rule-based static code-analysis tools (Sonar, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3) were evaluated.<n>Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities, we measure classical accuracy (precision, recall, F-score), analysis latency, granularity and the developer effort required to vet true positives.<n>We recommend a hybrid pipeline: employ language models early in development for broad, context-aware detection and
arXiv Detail & Related papers (2025-08-06T13:48:38Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings [32.72039589832989]
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency.<n>These advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards.<n>We propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains.
arXiv Detail & Related papers (2025-03-17T21:41:37Z) - CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation [20.013757490442064]
We introduce CodeIF, the first benchmark designed to assess the abilities of Large Language Models (LLMs) to adhere to task-oriented instructions.<n>CodeIF encompasses a broad range of tasks, including function synthesis, algorithmic instructions, and code explanation.<n>We conduct extensive experiments with LLMs, analyzing their strengths and limitations in meeting the demands of these tasks.
arXiv Detail & Related papers (2025-02-26T14:19:49Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.<n>It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.<n>Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? [8.0988059417354]
We propose a range of approaches to improve the performance of AI-generated code detection.
Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55.
arXiv Detail & Related papers (2024-11-06T22:48:18Z) - Harnessing the Power of LLMs in Source Code Vulnerability Detection [0.0]
Software vulnerabilities, caused by unintentional flaws in source code, are a primary root cause of cyberattacks.
We harness Large Language Models' capabilities to analyze source code and detect known vulnerabilities.
arXiv Detail & Related papers (2024-08-07T00:48:49Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.<n>Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - A Comparative Study of Code Generation using ChatGPT 3.5 across 10
Programming Languages [0.0]
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training.
This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022.
The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains.
arXiv Detail & Related papers (2023-08-08T15:02:32Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.