Fugu-MT 論文翻訳(概要): PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

論文の概要: PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

arxiv url: http://arxiv.org/abs/2605.01123v1
Date: Fri, 01 May 2026 21:49:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.594876
Title: PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
Title（参考訳）: PERSA:LLMを用いた教授型パーソナライズされたフィードバックのための強化学習
Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou,
Abstract要約: 本研究では,人間フィードバックからの強化学習をトランスフォーマーベースのLLMに適応させて,教授の音声レベルに適合するプログラミングフィードバックを生成する方法について検討する。 RLHFパイプラインであるPERSAを導入し、教授のデモンストレーションの教師付き微調整、ペアの選好からの報酬モデリング、およびプロキシポリシー最適化について紹介する。我々は,3つのコードフィードバックベンチマーク(APPS,PyFiXV,CodeReviewQA)に対して,スタイルアライメントと忠実度を補完する指標を用いて提案手法を評価した。
参考スコア（独自算出の注目度）: 1.8986796884429726
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
Abstract（参考訳）: 大きな言語モデル(LLM)は、教育環境では自動的なフィードバックを提供するが、LLMスタイルを特定のインストラクターのトーンに合わせることは困難である。私たちは、コア知識を犠牲にすることなく、ターゲットインストラクタースタイルに合わせるために、自動フィードバック生成のためのLLMをどうやって更新できるかを尋ねる。本研究では,人間フィードバックからの強化学習(Reinforcement Learning from Human Feedback, RLHF)が,変圧器をベースとしたLLMに適応して,教授が発声した音声にマッチするプログラミングフィードバックを生成する方法について検討する。本稿では,PPO(Proximal Policy Optimization)とPPO(Proximal Policy Optimization)を組み合わせたRLHFパイプラインを紹介する。変換器内部の解析によって動機づけられたPERSAは、パラメータ効率のよい微調整を施す。トップトランスブロックとフィードフォワードプロジェクションのみを更新し、スタイリスティックな制御性を高めながら、グローバルパラメータドリフトを最小限にする。我々は,3つのコードフィードバックベンチマーク(APPS,PyFiXV,CodeReviewQA)に対して,スタイルアライメントと忠実度を補完する指標を用いて提案手法を評価した。 Llama-3 と Gemma-2 のバックボーン全体にわたって、PERSA は、例えばAPPS では、正当性を保ちながら、最強の教授スタイルの転送を提供しており、スタイルアライメントスコア (SAC) は96.2%(ベースは 34.8% から)に向上し、補正精度 (CA) は Llama-3 と Gemma-2 で 100% まで向上している。 PERSAは、その言葉(内容の正しさ)と、その言葉(インストラクタのようなトーンと構造)の両方を合わせることによって、パーソナライズされた教育フィードバックへの実践的なルートを提供する。

論文の概要: PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

関連論文リスト