Related papers: Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning

Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning

URL: http://arxiv.org/abs/2506.06069v1
Date: Fri, 06 Jun 2025 13:23:37 GMT
Title: Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning
Authors: Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister,
Abstract summary: Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity.<n>We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection.<n>We propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet.
Score: 8.571111167616165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.

Related papers

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning [14.208804782749793]
We present R1-Code-Interpreter, an extension of a text-only Large Language Models (LLMs) trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL)<n>R1-Code-Interpreter autonomously generates multiple code queries during step-by-step reasoning.<n>Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution.
arXiv Detail & Related papers (2025-05-27T18:47:33Z)
Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
Program Semantic Inequivalence Game with Large Language Models [10.358176296850639]
Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics.<n>In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ.<n>We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources.
arXiv Detail & Related papers (2025-05-02T20:03:35Z)
Uncertainty-Guided Chain-of-Thought for Code Generation with LLMs [45.33160999781074]
Chain-of-Thought (CoT) reasoning has been demonstrated as an effective technique for improving the problem-solving capabilities of large language models (LLMs)<n>We introduce UnCert-CoT, an approach designed to enhance code generation by incorporating an uncertainty-aware CoT reasoning mechanism.
arXiv Detail & Related papers (2025-03-19T15:40:45Z)
An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models [6.976968804436321]
This paper proposes a novel approach to embedding source code by combining large language and sentence embedding models.<n>To evaluate the performance of our proposed approach, we conducted a series of experiments on three datasets with different programming languages.
arXiv Detail & Related papers (2024-09-23T01:03:15Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.<n>Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z)
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs [10.510325069289324]
We propose a self-refinement method aimed at improving the reliability of code generated by LLMs. Our approach is based on targeted Verification Questions (VQs) to identify potential bugs within the initial code. Our method attempts to repair these potential bugs by re-prompting the LLM with the targeted VQs and the initial code.
arXiv Detail & Related papers (2024-05-22T19:02:50Z)
Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension [18.919972400933393]
We propose an advanced pretraining task, "Next Token Prediction+" Following this pretraining, both Code Llama and StarCoder, the prevalent code domain pretraining models, display significant improvements on our logically equivalent code selection task and the code completion task.
arXiv Detail & Related papers (2024-04-13T03:11:07Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
We introduce code prompting, a chain of prompts that transforms a natural language problem into code. We find that code prompting exhibits a high-performance boost for multiple LLMs. Our analysis of GPT 3.5 reveals that the code formatting of the input problem is essential for performance improvement.
arXiv Detail & Related papers (2024-01-18T15:32:24Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.