AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
- URL: http://arxiv.org/abs/2412.16594v1
- Date: Sat, 21 Dec 2024 11:53:49 GMT
- Title: AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection
- Authors: Basak Demirok, Mucahid Kutlu,
- Abstract summary: AIGCodeSet is a dataset for AI-generated code detection tasks, specifically for the Python programming language.
We generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches.
Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets.
- Score: 0.0
- License:
- Abstract: With the rapid advancement of LLM models, they have become widely useful in various fields. While these AI systems can be used for code generation, significantly simplifying and accelerating the tasks of developers, their use for students to do assignments has raised ethical questions in the field of education. In this context, determining the author of a particular code becomes important. In this study, we introduce AIGCodeSet, a dataset for AI-generated code detection tasks, specifically for the Python programming language. We obtain the problem descriptions and human-written codes from the CodeNet dataset. Using the problem descriptions, we generate AI-written codes with CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash models in three approaches: i) generating code from the problem description alone, ii) generating code using the description along with human-written source code containing runtime errors, and iii) generating code using the problem description and human-written code that resulted in wrong answers. Lastly, we conducted a post-processing step to eliminate LLM output irrelevant to code snippets. Overall, AIGCodeSet consists of 2,828 AI-generated and 4,755 human-written code snippets. We share our code with the research community to support studies on this important topic and provide performance results for baseline AI-generated code detection methods.
Related papers
- No Man is an Island: Towards Fully Automatic Programming by Code Search, Code Generation and Program Repair [9.562123938545522]
toolname can integrate various code search, generation, and repair tools, combining these three research areas together for the first time.
We conduct preliminary experiments to demonstrate the potential of our framework, eg helping CodeLlama solve 267 programming problems with an improvement of 62.53%.
arXiv Detail & Related papers (2024-09-05T06:24:29Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Bugs in Large Language Models Generated Code: An Empirical Study [12.625305075672456]
Large Language Models (LLMs) for code have gained significant attention recently.
Similar to human-written code, LLM-generated code is prone to bugs.
This paper examines a sample of 333 bugs collected from code generated using three leading LLMs.
arXiv Detail & Related papers (2024-03-13T20:12:01Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Assessing AI Detectors in Identifying AI-Generated Code: Implications
for Education [8.592066814291819]
We present an empirical study where the LLM is examined for its attempts to bypass detection by AIGC Detectors.
This is achieved by generating code in response to a given question using different variants.
Our results demonstrate that existing AIGC Detectors perform poorly in distinguishing between human-written code and AI-generated code.
arXiv Detail & Related papers (2024-01-08T05:53:52Z) - A^3-CodGen: A Repository-Level Code Generation Framework for Code Reuse with Local-Aware, Global-Aware, and Third-Party-Library-Aware [13.27883339389175]
We propose a novel code generation framework, dubbed A3-CodGen, to harness information within the code repository to generate code with fewer potential logical errors.
Results demonstrate that by adopting the A3-CodGen framework, we successfully extract, fuse, and feed code repository information into the LLM, generating more accurate, efficient, and highly reusable code.
arXiv Detail & Related papers (2023-12-10T05:36:06Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.