Fugu-MT 論文翻訳(概要): How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

論文の概要: How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

arxiv url: http://arxiv.org/abs/2512.10415v1
Date: Thu, 11 Dec 2025 08:28:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.278647
Title: How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation
Title（参考訳）: AI TAをトリックする方法: LLMコード評価における学術的ジェイルブレイクの体系的研究
Authors: Devanshu Sahoo, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Dhruv Kumar,
Abstract要約: 大規模言語モデル(LLM)をコード評価の自動判断として使用することは、学術的環境においてますます普及しつつある。しかし、その信頼性は、不正な学習上の利点を誘発し、保存されていない学術的優位性を確保するために、敵対的促進戦略を採用する学生によって損なわれる可能性がある。本稿では,LLMに基づく自動コード評価器を学術的に初めて大規模に研究した。
参考スコア（独自算出の注目度）: 7.743292557234699
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The use of Large Language Models (LLMs) as automatic judges for code evaluation is becoming increasingly prevalent in academic environments. But their reliability can be compromised by students who may employ adversarial prompting strategies in order to induce misgrading and secure undeserved academic advantages. In this paper, we present the first large-scale study of jailbreaking LLM-based automated code evaluators in academic context. Our contributions are: (i) We systematically adapt 20+ jailbreaking strategies for jailbreaking AI code evaluators in the academic context, defining a new class of attacks termed academic jailbreaking. (ii) We release a poisoned dataset of 25K adversarial student submissions, specifically designed for the academic code-evaluation setting, sourced from diverse real-world coursework and paired with rubrics and human-graded references, and (iii) In order to capture the multidimensional impact of academic jailbreaking, we systematically adapt and define three jailbreaking metrics (Jailbreak Success Rate, Score Inflation, and Harmfulness). (iv) We comprehensively evalulate the academic jailbreaking attacks using six LLMs. We find that these models exhibit significant vulnerability, particularly to persuasive and role-play-based attacks (up to 97% JSR). Our adversarial dataset and benchmark suite lay the groundwork for next-generation robust LLM-based evaluators in academic code assessment.
Abstract（参考訳）: 大規模言語モデル(LLM)をコード評価の自動判断として使用することは、学術的環境においてますます普及しつつある。しかし、その信頼性は、不正な学習上の利点を誘発し、保存されていない学術的優位性を確保するために、敵対的促進戦略を採用する学生によって損なわれる可能性がある。本稿では,LLMをベースとした自動コード評価器を学術的に初めて大規模に研究した。私たちの貢献は次のとおりです。 i) 学術的文脈でAIコード評価器をジェイルブレイクするための20以上のジェイルブレイク戦略を体系的に適用し, 学術的ジェイルブレイクと呼ばれる新たな攻撃方法を定義した。 (二)様々な現実世界の講習会から得られた学術的コード評価の設定に特化して設計された25Kの反対学生応募の有毒なデータセットを公開し、かつ、ルーブリックと人間段階の参考書を組み合わせて公開する。 3) 学術的ジェイルブレイクの多次元的影響を捉え, 3つのジェイルブレイク指標(ジェイルブレイク成功率, スコアインフレーション, ハームフルネス)を体系的に適用し, 定義する。 (4)6個のLDMを用いて,学術的ジェイルブレイク攻撃を包括的に評価した。これらのモデルは、特に説得力のあるロールプレイベースの攻撃(最大97%のJSR)に対して、重大な脆弱性を示します。我々は,学術的コードアセスメントにおいて,LLMをベースとした次世代のロバストな評価ツールの基盤となるデータセットとベンチマークスイートについて検討した。

論文の概要: How to Trick Your AI TA: A Systematic Study of Academic Jailbreaking in LLM Code Evaluation

関連論文リスト