Fugu-MT 論文翻訳(概要): Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

論文の概要: Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

arxiv url: http://arxiv.org/abs/2604.20726v2
Date: Thu, 23 Apr 2026 08:13:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.061778
Title: Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization
Title（参考訳）: プロンプト最適化によるLLM-as-a-Judge Disposition on Free Text Legal QA
Authors: Mohamed Hesham Elganayni, Runsheng Chen, Sebastian Nagl, Matthias Grabmair,
Abstract要約: 本研究は,LLM-as-a-Judge評価における自由テキスト法定質問応答における迅速な設計と判断選択の役割について検討する。自動タスクプロンプト最適化が人間中心設計よりも優れているか, 判断フィードバックスタイルによって最適化の有効性が変化するか, 判断者間での転送が最適化されるかを検討する。
参考スコア（独自算出の注目度）: 9.980463738635718
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work explores the role of prompt design and judge selection in LLM-as-a-Judge evaluations of free text legal question answering. We examine whether automatic task prompt optimization improves over human-centered design, whether optimization effectiveness varies by judge feedback style, and whether optimized prompts transfer across judges. We systematically address these questions on the LEXam benchmark by optimizing task prompts using the ProTeGi method with feedback from two judges (Qwen3-32B, DeepSeek-V3) across four task models, and then testing cross-judge transfer. Automatic optimization consistently outperforms the baseline, with lenient judge feedback yielding higher and more consistent gains than strict judge feedback. Prompts optimized with lenient feedback transfer better to strict judges than the reverse direction. Analysis reveals that lenient judges provide permissive feedback, yielding prompts with broader applicability, whereas strict judges produce restrictive feedback, leading to judge-specific overfitting. Our findings demonstrate algorithmically optimizing prompts on training data can outperform human-centered prompt design and that judges' dispositions during optimization shape prompt generalizability.
Abstract（参考訳）: 本研究は,LLM-as-a-Judge評価における自由テキスト法定質問応答における迅速な設計と判断選択の役割について検討する。自動タスクプロンプト最適化が人間中心設計よりも優れているか, 判断フィードバックスタイルによって最適化の有効性が変化するか, 判断者間での転送が最適化されるかを検討する。本稿では,2人の審査員(Qwen3-32B,DeepSeek-V3)からのフィードバックでProTeGi法によるタスクプロンプトを最適化し,これらの質問をLEXamベンチマーク上で体系的に解決し,クロスジャッジ転送をテストする。自動最適化は、厳格な判断フィードバックよりも高い、より一貫した利得をもたらす、寛大な判断フィードバックによって、ベースラインを一貫して上回る。寛大なフィードバック伝達に最適化されたプロンプトは、逆方向よりも厳格な判断に優れている。分析によれば、寛大な裁判官は寛大なフィードバックを提供し、より広い適用性を持つプロンプトを得られるのに対し、厳格な裁判官は限定的なフィードバックを生成し、裁判官固有の過度な適合をもたらす。本研究は,学習データに対するアルゴリズム的最適化プロンプトが,人間中心のプロンプト設計より優れ,最適化時の判断者が一般化し易いことを示すものである。

論文の概要: Exploiting LLM-as-a-Judge Disposition on Free Text Legal QA via Prompt Optimization

関連論文リスト