Fugu-MT 論文翻訳(概要): Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

論文の概要: Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

arxiv url: http://arxiv.org/abs/2601.19264v1
Date: Tue, 27 Jan 2026 06:43:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-28 15:26:51.211246
Title: Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code
Title（参考訳）: Whitespaces Don't Lie: マシン生成コード検出のための機能駆動型および埋め込みベースのアプローチ
Authors: Syed Mehedi Hasan Nirob, Shamim Ehsan, Moqsadur Rahman, Summit Haque,
Abstract要約: 大規模言語モデル(LLM)は、自然言語のプロンプトから可塑性ソースコードを驚くほど簡単に合成できる。本稿では,2つの相補的アプローチを比較することで,機械生成コードと人間の書き起こしを区別する問題について検討する。
参考スコア（独自算出の注目度）: 0.2624902795082451
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made it remarkably easy to synthesize plausible source code from natural language prompts. While this accelerates software development and supports learning, it also raises new risks for academic integrity, authorship attribution, and responsible AI use. This paper investigates the problem of distinguishing human-written from machine-generated code by comparing two complementary approaches: feature-based detectors built from lightweight, interpretable stylometric and structural properties of code, and embedding-based detectors leveraging pretrained code encoders. Using a recent large-scale benchmark dataset of 600k human-written and AI-generated code samples, we find that feature-based models achieve strong performance (ROC-AUC 0.995, PR-AUC 0.995, F1 0.971), while embedding-based models with CodeBERT embeddings are also very competitive (ROC-AUC 0.994, PR-AUC 0.994, F1 0.965). Analysis shows that features tied to indentation and whitespace provide particularly discriminative cues, whereas embeddings capture deeper semantic patterns and yield slightly higher precision. These findings underscore the trade-offs between interpretability and generalization, offering practical guidance for deploying robust code-origin detection in academic and industrial contexts.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自然言語のプロンプトから可塑性ソースコードを驚くほど簡単に合成できる。これはソフトウェア開発を加速させ、学習をサポートする一方で、学術的完全性、著者の帰属、責任あるAI利用に対する新たなリスクも引き起こす。そこで,本研究では,人間の書き起こしと機械生成コードの区別の問題について,コードの軽量で解釈可能なテクスチャ特性と構造特性から構築された特徴ベース検出器と,事前訓練されたコードエンコーダを利用した埋め込みベース検出器の2つの相補的なアプローチを比較して検討する。最近の600kの人書きおよびAI生成コードサンプルの大規模なベンチマークデータセットを用いて、機能ベースのモデルが強力なパフォーマンスを達成する(ROC-AUC 0.995, PR-AUC 0.995, F1 0.971)のに対し、CodeBERTの埋め込みモデルも非常に競争力がある(ROC-AUC 0.994, PR-AUC 0.994, F1 0.965)。解析によると、インデンテーションやホワイトスペースに結びついた特徴は、特に差別的な手がかりを提供する一方、埋め込みはより深い意味パターンを捉え、わずかに精度を高めている。これらの知見は、解釈可能性と一般化のトレードオフを浮き彫りにして、学術的・産業的な文脈で堅牢なコードオリジン検出を展開するための実践的なガイダンスを提供する。

論文の概要: Whitespaces Don't Lie: Feature-Driven and Embedding-Based Approaches for Detecting Machine-Generated Code

関連論文リスト