Related papers: Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku

URL: http://arxiv.org/abs/2409.01382v1
Date: Mon, 2 Sep 2024 17:25:15 GMT
Title: Automatic Detection of LLM-generated Code: A Case Study of Claude 3 Haiku
Authors: Musfiqur Rahman, SayedHassan Khatoonabadi, Ahmad Abdellatif, Emad Shihab,
Abstract summary: We perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is.
Score: 3.5411188659374213
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Using Large Language Models (LLMs) has gained popularity among software developers for generating source code. However, the use of LLM-generated code can introduce risks of adding suboptimal, defective, and vulnerable code. This makes it necessary to devise methods for the accurate detection of LLM-generated code. Toward this goal, we perform a case study of Claude 3 Haiku (or Claude 3 for brevity) on CodeSearchNet dataset. We divide our analyses into two parts: function-level and class-level. We extract 22 software metric features, such as Code Lines and Cyclomatic Complexity, for each level of granularity. We then analyze code snippets generated by Claude 3 and their human-authored counterparts using the extracted features to understand how unique the code generated by Claude 3 is. In the following step, we use the unique characteristics of Claude 3-generated code to build Machine Learning (ML) models and identify which features of the code snippets make them more detectable by ML models. Our results indicate that Claude 3 tends to generate longer functions, but shorter classes than humans, and this characteristic can be used to detect Claude 3-generated code with ML models with 82% and 66% accuracies for function-level and class-level snippets, respectively.

Related papers

Evaluating and Achieving Controllable Code Completion in Code LLM [89.64782747840225]
We present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench)<n>We reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks.<n>The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench.
arXiv Detail & Related papers (2026-01-22T11:40:04Z)
Multicalibration for LLM-based Code Generation [0.3568466510804538]
Multicalibration can yield distinct improvements over both uncalibrated token likelihoods and baseline calibrations.<n>We make our dataset (consisting of code generations, likelihoods, and correctness labels) available for future research on code LLM calibration.
arXiv Detail & Related papers (2025-12-09T17:04:01Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis [0.40964539027092917]
We present the first systematic study of Large Language Models' ability to generate efficient C implementations of graph-analysis routines.<n>Eight state-of-the-art models are benchmarked.<n>Results show that Claude Sonnet 4 Extended achieves the best result in the case of ready-to-use code generation and efficiency.
arXiv Detail & Related papers (2025-07-09T00:46:30Z)
I Know Which LLM Wrote Your Code Last Summer: LLM generated Code Stylometry for Authorship Attribution [0.0580448704422069]
This paper presents the first systematic study of authorship attribution for C programs.<n>We release CodeT5-Authorship, a novel model that uses only the encoder layers from the original CodeT5 encoder-decoder architecture.<n>Our model achieves 97.56% accuracy in distinguishing C programs generated by closely related models.
arXiv Detail & Related papers (2025-06-18T19:49:41Z)
Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift. We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z)
Quantizing Large Language Models for Code Generation: A Differentiated Replication [51.85505914274633]
Large Language Models (LLMs) have shown an impressive capability in code generation and, specifically, to automatically implement requirements described in natural language. LLMs pose significant challenges related to their memory (and, consequently, carbon) footprint. New frontier for LLM quantization is 4-bit precision, resulting in an average memory footprint reduction of 70%.
arXiv Detail & Related papers (2025-03-10T09:26:08Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community. Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Assessing the Code Clone Detection Capability of Large Language Models [0.0]
The evaluation involves testing the models on a variety of code pairs of different clone types and levels of similarity. Findings indicate that GPT-4 consistently surpasses GPT-3.5 across all clone types.
arXiv Detail & Related papers (2024-07-02T16:20:44Z)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification [73.66920648926161]
We introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations.
arXiv Detail & Related papers (2024-04-30T23:56:38Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation. We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey [40.99060616674878]
Large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges. This paper provides the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts. We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods.
arXiv Detail & Related papers (2023-08-02T14:56:01Z)
Coarse-Tuning Models of Code with Reinforcement Learning Feedback [0.0]
Large Language Models (LLMs) pre-trained on code have emerged as the dominant approach to program synthesis. We propose RLCF, that further trains a pre-trained LLM via reinforcement learning, using feedback from a grounding function that scores the quality of the code.
arXiv Detail & Related papers (2023-05-25T22:09:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.