Fugu-MT 論文翻訳(概要): Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

論文の概要: Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

arxiv url: http://arxiv.org/abs/2509.13081v1
Date: Tue, 16 Sep 2025 13:39:29 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.109098
Title: Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
Title（参考訳）: 形状説明: GRPO用エンコーダオンリー変圧器を用いた意味的リワードモデリング
Authors: Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras,
Abstract要約: グループ相対政策最適化フレームワークにおいて,報酬形成のための新たなアプローチを導入する。私たちの中心的な貢献は、セマンティック報酬モデルとして、小型で効率的なエンコーダのみのトランスフォーマーを使用することです。本手法は,イタリア医学部入学試験のモデルを訓練する作業に適用する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
Abstract（参考訳）: 大きな言語モデル(LLMs)は人間のようなテキストを生成するのに優れていますが、その出力を教育的健全性のような複雑で質的な目標と整合させることは大きな課題です。標準的な強化学習技術は、しばしば遅くて高価なLCM-as-a-judge評価やROUGEのような脆弱なキーワードベースのメトリクスに依存しており、高品質な説明のセマンティックな本質を捉えていない。本稿では,GRPO(Group Relative Policy Optimisation)フレームワークにおける報酬形成の新たなアプローチを紹介する。私たちの中心的な貢献は、セマンティック報酬モデルとして、小型で効率的なエンコーダのみのトランスフォーマーを使用することです。このモデルは、生成した説明と地味な参照のコサイン類似性に基づいて、密集した意味的にリッチな報酬信号を提供し、事実的正確であるだけでなく、専門家の推論と構造的および概念的に整合した説明へのポリシーを導く。本手法は,標準的なドメイン適応型継続事前訓練 (CPT) と教師付き微調整 (SFT) の後に,イタリアの医学部入学試験のモデルを訓練する作業に適用する。提案したセマンティック報酬を用いたGRPOは,より強力なSFTベースラインに対する説明の忠実さと明快さを著しく向上し,複雑な生成タスクにおけるニュアンスド報酬形成のための軽量エンコーダモデルの有用性を示す。

論文の概要: Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

関連論文リスト