$\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection
- URL: http://arxiv.org/abs/2507.10583v3
- Date: Wed, 06 Aug 2025 19:26:28 GMT
- Title: $\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection
- Authors: Daniil Orel, Indraneil Paul, Iryna Gurevych, Preslav Nakov,
- Abstract summary: $textbf$textttDroidCollection$$ is an open data suite for training and evaluating machine-generated code detectors.<n>It includes over a million code samples, seven programming languages, outputs from 43 coding models, and three real-world coding domains.<n>We also develop a suite of encoder-only detectors trained using a multi-task objective over $textttDroidCollection$$.
- Score: 75.6327970381944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we compile $\textbf{$\texttt{DroidCollection}$}$, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop $\textbf{$\texttt{DroidDetect}$}$, a suite of encoder-only detectors trained using a multi-task objective over $\texttt{DroidCollection}$. Our experiments show that existing detectors' performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.
Related papers
- MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios [0.0]
We introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go.<n>Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets.
arXiv Detail & Related papers (2025-07-29T11:16:55Z) - Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors [65.27124213266491]
We propose textbfContrastive textbfParaphrase textbfAttack (CoPA), a training-free method that effectively deceives text detectors.<n>CoPA constructs an auxiliary machine-like word distribution as a contrast to the human-like distribution generated by large language models.<n>Our theoretical analysis suggests the superiority of the proposed attack.
arXiv Detail & Related papers (2025-05-21T10:08:39Z) - On Training a Neural Network to Explain Binaries [43.27448128029069]
In this work, we investigate the possibility of training a deep neural network on the task of binary code understanding.
We build our own dataset derived from a capture of Stack Overflow containing 1.1M entries.
arXiv Detail & Related papers (2024-04-30T15:34:51Z) - D$^3$: Scaling Up Deepfake Detection by Learning from Discrepancy [29.919663502808575]
Existing literature emphasizes the generalization capability of deepfake detection on unseen generators.<n>This work seeks a step toward a universal deepfake detection system with better generalization and robustness.
arXiv Detail & Related papers (2024-04-06T10:45:02Z) - Data-Independent Operator: A Training-Free Artifact Representation
Extractor for Generalizable Deepfake Detection [105.9932053078449]
In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations.
Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources.
Our detector achieves a remarkable improvement of $13.3%$, establishing a new state-of-the-art performance.
arXiv Detail & Related papers (2024-03-11T15:22:28Z) - Assessing AI Detectors in Identifying AI-Generated Code: Implications
for Education [8.592066814291819]
We present an empirical study where the LLM is examined for its attempts to bypass detection by AIGC Detectors.
This is achieved by generating code in response to a given question using different variants.
Our results demonstrate that existing AIGC Detectors perform poorly in distinguishing between human-written code and AI-generated code.
arXiv Detail & Related papers (2024-01-08T05:53:52Z) - Assaying on the Robustness of Zero-Shot Machine-Generated Text Detectors [57.7003399760813]
We explore advanced Large Language Models (LLMs) and their specialized variants, contributing to this field in several ways.
We uncover a significant correlation between topics and detection performance.
These investigations shed light on the adaptability and robustness of these detection methods across diverse topics.
arXiv Detail & Related papers (2023-12-20T10:53:53Z) - ConDA: Contrastive Domain Adaptation for AI-generated Text Detection [17.8787054992985]
Large language models (LLMs) are increasingly being used for generating text in news articles.
Given the potential malicious nature in which these LLMs can be used to generate disinformation at scale, it is important to build effective detectors for such AI-generated text.
In this work we tackle this data problem, in detecting AI-generated news text, and frame the problem as an unsupervised domain adaptation task.
arXiv Detail & Related papers (2023-09-07T19:51:30Z) - Large Language Models can be Guided to Evade AI-Generated Text Detection [40.7707919628752]
Large language models (LLMs) have shown remarkable performance in various tasks and have been extensively utilized by the public.
We equip LLMs with prompts, rather than relying on an external paraphraser, to evaluate the vulnerability of these detectors.
We propose a novel Substitution-based In-Context example optimization method (SICO) to automatically construct prompts for evading the detectors.
arXiv Detail & Related papers (2023-05-18T10:03:25Z) - Can AI-Generated Text be Reliably Detected? [50.95804851595018]
Large Language Models (LLMs) perform impressively well in various applications.<n>The potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concern about their responsible use.<n>We stress-test the robustness of these AI text detectors in the presence of an attacker.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - Self-Supervised Person Detection in 2D Range Data using a Calibrated
Camera [83.31666463259849]
We propose a method to automatically generate training labels (called pseudo-labels) for 2D LiDAR-based person detectors.
We show that self-supervised detectors, trained or fine-tuned with pseudo-labels, outperform detectors trained using manual annotations.
Our method is an effective way to improve person detectors during deployment without any additional labeling effort.
arXiv Detail & Related papers (2020-12-16T12:10:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.