Fugu-MT 論文翻訳(概要): LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

論文の概要: LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

arxiv url: http://arxiv.org/abs/2506.09443v1
Date: Wed, 11 Jun 2025 06:48:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-13 06:35:02.644698
Title: LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge
Title（参考訳）: LLMは信頼できない (Yet?): LLM-as-a-Judgeのロバスト性に関する総合的評価
Authors: Songze Li, Chuokun Xu, Jiaying Wang, Xueluan Gong, Chen Chen, Jirui Zhang, Jun Wang, Kwok-Yan Lam, Shouling Ji,
Abstract要約: 大規模言語モデル(LLM)は、様々なタスクにまたがる顕著な知性を示してきた。これらのシステムは、評価結果を操作できる敵攻撃の影響を受けやすい。 LLMに基づく審査員による既存の評価手法は、しばしば断片的であり、包括的な評価のための統一された枠組みが欠如している。
参考スコア（独自算出の注目度）: 44.6358611761225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable intelligence across various tasks, which has inspired the development and widespread adoption of LLM-as-a-Judge systems for automated model testing, such as red teaming and benchmarking. However, these systems are susceptible to adversarial attacks that can manipulate evaluation outcomes, raising concerns about their robustness and, consequently, their trustworthiness. Existing evaluation methods adopted by LLM-based judges are often piecemeal and lack a unified framework for comprehensive assessment. Furthermore, prompt template and model selections for improving judge robustness have been rarely explored, and their performance in real-world settings remains largely unverified. To address these gaps, we introduce RobustJudge, a fully automated and scalable framework designed to systematically evaluate the robustness of LLM-as-a-Judge systems. RobustJudge investigates the impact of attack methods and defense strategies (RQ1), explores the influence of prompt template and model selection (RQ2), and assesses the robustness of real-world LLM-as-a-Judge applications (RQ3).Our main findings are: (1) LLM-as-a-Judge systems are still vulnerable to a range of adversarial attacks, including Combined Attack and PAIR, while defense mechanisms such as Re-tokenization and LLM-based Detectors offer improved protection; (2) Robustness is highly sensitive to the choice of prompt template and judge models. Our proposed prompt template optimization method can improve robustness, and JudgeLM-13B demonstrates strong performance as a robust open-source judge; (3) Applying RobustJudge to Alibaba's PAI platform reveals previously unreported vulnerabilities. The source code of RobustJudge is provided at https://github.com/S3IC-Lab/RobustJudge.
Abstract（参考訳）: 大規模言語モデル(LLM)は、様々なタスクにまたがる顕著なインテリジェンスを示しており、レッドチームやベンチマークのような自動モデルテストのためのLLM-as-a-Judgeシステムの開発と普及にインスピレーションを与えている。しかし、これらのシステムは、評価結果を操作し、その堅牢性に対する懸念を提起し、その結果、信頼感を高める敵対的な攻撃に影響を受けやすい。 LLMに基づく審査員による既存の評価手法は、しばしば断片的であり、包括的な評価のための統一された枠組みが欠如している。さらに,判定の堅牢性向上のためのテンプレートとモデル選択の迅速な検討はめったに行われておらず,実環境におけるその性能は未検証のままである。このギャップに対処するために、LLM-as-a-Judgeシステムの堅牢性を体系的に評価するために設計された、完全に自動化されスケーラブルなフレームワークであるRobustJudgeを紹介します。 RobustJudgeは、攻撃方法と防御戦略(RQ1)の影響を調査し、プロンプトテンプレートとモデル選択(RQ2)の影響を調査し、現実世界のLLM-as-a-Judgeアプリケーション(RQ3)の堅牢性を評価する。主な発見は, 1) LLM-as-a-Judgeシステムはまだ, 攻撃とPAIRの組み合わせを含む様々な敵攻撃に対して脆弱であり, 一方, 再起動やLSMベースのディテクターなどの防御機構は, 防御性を向上し, 2) ロバストネスは, プロンプトテンプレートと判定モデルの選択に非常に敏感である。提案したプロンプトテンプレート最適化手法はロバスト性を向上し,JiceLM-13Bはロバストなオープンソースジャッジとして高いパフォーマンスを示す。 RobustJudgeのソースコードはhttps://github.com/S3IC-Lab/RobustJudgeにある。

論文の概要: LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

関連論文リスト