AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
- URL: http://arxiv.org/abs/2602.02079v1
- Date: Mon, 02 Feb 2026 13:24:14 GMT
- Title: AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
- Authors: Daniil Orel, Dilshod Azizov, Indraneil Paul, Yuxia Wang, Iryna Gurevych, Preslav Nakov,
- Abstract summary: AICD Bench is the most comprehensive benchmark for AI-generated code detection.<n>It spans $emph2M examples$, $emph77 models$ across $emph11 families$, and $emph9 programming languages$, including recent reasoning models.
- Score: 91.21422299346199
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly capable of generating functional source code, raising concerns about authorship, accountability, and security. While detecting AI-generated code is critical, existing datasets and benchmarks are narrow, typically limited to binary human-machine classification under in-distribution settings. To bridge this gap, we introduce $\emph{AICD Bench}$, the most comprehensive benchmark for AI-generated code detection. It spans $\emph{2M examples}$, $\emph{77 models}$ across $\emph{11 families}$, and $\emph{9 programming languages}$, including recent reasoning models. Beyond scale, AICD Bench introduces three realistic detection tasks: ($\emph{i}$)~$\emph{Robust Binary Classification}$ under distribution shifts in language and domain, ($\emph{ii}$)~$\emph{Model Family Attribution}$, grouping generators by architectural lineage, and ($\emph{iii}$)~$\emph{Fine-Grained Human-Machine Classification}$ across human, machine, hybrid, and adversarial code. Extensive evaluation on neural and classical detectors shows that performance remains far below practical usability, particularly under distribution shift and for hybrid or adversarial code. We release AICD Bench as a $\emph{unified, challenging evaluation suite}$ to drive the next generation of robust approaches for AI-generated code detection. The data and the code are available at https://huggingface.co/AICD-bench}.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers [61.84396402100827]
We propose the Indirect Neural Corrector ($mathrmINC$), which integrates learned corrections into the governing equations.<n>$mathrmINC$ reduces the error amplification on the order of $t-1 + L$, where $t$ is the timestep and $L$ the Lipschitz constant.<n>We test $mathrmINC$ in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence.
arXiv Detail & Related papers (2025-11-16T20:14:28Z) - DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code [5.38764489657443]
Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains.<n>Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two.<n>We propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language.
arXiv Detail & Related papers (2025-10-21T00:17:00Z) - MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios [0.0]
We introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go.<n>Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets.
arXiv Detail & Related papers (2025-07-29T11:16:55Z) - $\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection [75.6327970381944]
$textbf$textttDroidCollection$$ is an open data suite for training and evaluating machine-generated code detectors.<n>It includes over a million code samples, seven programming languages, outputs from 43 coding models, and three real-world coding domains.<n>We also develop a suite of encoder-only detectors trained using a multi-task objective over $textttDroidCollection$$.
arXiv Detail & Related papers (2025-07-11T12:19:06Z) - BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation [77.55074597806035]
GenBuster-200K is a large-scale, high-quality AI-generated video dataset featuring 200K high-resolution video clips.<n>BusterX is a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning.
arXiv Detail & Related papers (2025-05-19T02:06:43Z) - Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
This study systematically evaluations twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.<n>Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z) - An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We? [8.0988059417354]
We propose a range of approaches to improve the performance of AI-generated code detection.
Our best model outperforms state-of-the-art AI-generated code detector (GPTSniffer) and achieves an F1 score of 82.55.
arXiv Detail & Related papers (2024-11-06T22:48:18Z) - CGEMs: A Metric Model for Automatic Code Generation using GPT-3 [0.0]
This work aims to validate AI-generated content using theoretical proofs or by using Monte-Carlo simulation methods.
In this case, we use the latter approach to test/validate a statistically significant number of samples.
The various metrics that are garnered in this work to support the evaluation of generated code are as follows: Compilation, NL description to logic conversion, number of edits needed, some of the commonly used static-code metrics and NLP metrics.
arXiv Detail & Related papers (2021-08-23T13:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.