Fugu-MT 論文翻訳(概要): Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

論文の概要: Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

arxiv url: http://arxiv.org/abs/2605.00754v1
Date: Fri, 01 May 2026 16:07:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:29.009929
Title: Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring
Title（参考訳）: Themis: フレキシブルな多言語スコーリングのためのロバストな多言語コードリワードモデルのトレーニング
Authors: Indraneil Paul, Glavaš Glavas, Iryna Gurevych,
Abstract要約: Themis-CodePreferenceは、これまでで最大のコード好みのオープンソースコレクションで、多言語コード報酬モデルのスイートであるThemis-RMのトレーニングに使用しています。多様な嗜好に基づいてトレーニングを行う場合, 積極的なスケーリング傾向, 強い言語間移動を示す実験とアブリケーションを行った。
参考スコア（独自算出の注目度）: 49.937275213222186
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.
Abstract（参考訳）: Reward Model (RM) は、言語モデル(LM)のトレーニング後のプレイブックに必須のフィクスチャとなり、ポリシーアライメントとテストタイムスケーリングを可能にした。しかし、コード生成におけるRMの応用に関する研究は比較的不十分であり、既存の作業は主に実行フィードバックに焦点を当てている。この選択は、自己完結した実行可能コードに対する機能的正しさの最適化を後処理に制約する。本研究では,多言語多基準RMの訓練と評価について検討する。この目的のために、まずThemis-CodeRewardBenchをコンパイルし、50以上のコード、数学、汎用RMをプロファイルする5つの選好次元(基準)と8つのプログラミング言語のコードRMを評価するベンチマークを作成した。機能的正しさのスコアを超えて、現在のRMの限られた習熟度を観察し、これまでで最大のコードの選好コレクション(350k以上の選好ペア)であるThemis-CodePreferenceを開発し、600Mから32Bのパラメータを含む柔軟な多重基準スコアリングのための多言語コード報酬モデルであるThemis-RMをトレーニングするために使用します。実験と改善は、様々な好みのトレーニングにおいて、ポジティブなスケーリング傾向、強い言語間移動、信頼性のあるコード報酬モデリングにおけるマルチ基準トレーニングの重要性を示す。

論文の概要: Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

関連論文リスト