Fugu-MT 論文翻訳(概要): On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

論文の概要: On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

arxiv url: http://arxiv.org/abs/2509.23542v1
Date: Sun, 28 Sep 2025 00:43:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.280567
Title: On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization
Title（参考訳）: 微調整LDM審査員のシェルフライフ:未来証明,後方適合性,質問一般化
Authors: Janvijay Singh, Austin Xu, Yilun Zhou, Yefan Zhou, Dilek Hakkani-Tur, Shafiq Joty,
Abstract要約: 我々は、精巧な審査員の棚の生活に影響を与える3つの側面を定式化する。実験によると、将来の防食はほとんどのモデルにとって難しい。継続学習は、より古い応答分布と新しい応答分布のシフトによりバランスのとれた適応を提供する。
参考スコア（独自算出の注目度）: 46.240395528043365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility -- how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization -- how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT- and DPO-based finetuning algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators.
Abstract（参考訳）: LLM-as-a-judgeパラダイムは、自由テキストモデル応答の評価とモデルアライメントと微調整のための報酬モデルの両方に広く用いられている。近年,判断に特有のデータを持つ微調整の審査員が,フロンティアモデルを審査員として直接的に促進するよりも,しばしば好まれる選択として出現している。しかし、基準評価は、現実世界の展開に関して微調整された審査員のいくつかの実践的な懸念を無視している。本稿では, これらの審査員のシェルフライフに影響を与える3つの側面を同定し, 形式化する: 将来の証明と後方互換性 - 今日のジェネレータモデルによる応答の微調整が, 将来のモデルや過去のモデルによる応答でどのように実行されるか - および, 質問の一般化 - 審査員がテスト時に目に見えない質問にどのように一般化するか - について述べる。本稿では,これらの3つの側面を,列車および試験分布,SFTおよびDPOに基づく3つのファインタニングアルゴリズム,および3つの異なるベースモデルで統一した枠組みで検討する。実験によると、ほとんどのモデルでは将来の防御は難しいが、後方互換性は比較的容易であり、DPO訓練モデルではパフォーマンスが継続的に改善されている。さらに、継続学習は、古い応答分布と新しい応答分布のシフトに、よりバランスのとれた適応を提供する。さらに、全てのモデルは、トレーニング中に見られた質問から目に見えないものに移行する際に、ある種のパフォーマンス劣化を観測し、現在の裁判官が、目に見えない質問に完全に一般化していないことを示す。これらの知見は, 常に変化する発電機の面において, 審査モデルを開発, 展開するための実践的考察の洞察を与えるものである。

論文の概要: On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

関連論文リスト