Fugu-MT 論文翻訳(概要): CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

論文の概要: CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

arxiv url: http://arxiv.org/abs/2510.18471v1
Date: Tue, 21 Oct 2025 09:48:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.339716
Title: CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
Title（参考訳）: CodeRL+: 実行セマンティックアライメントによる強化によるコード生成の改善
Authors: Xue Jiang, Yihong Dong, Mengyang Liu, Hongyi Deng, Tian Wang, Yongding Tao, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Fei Huang, Yongbin Li, Ge Li,
Abstract要約: 大きな言語モデル(LLM)は、巨大なコードコーパスから学習することで、コード生成において優れています。テキストパターンのトレーニングと機能的正しさの目標の間には、基本的な意味的ギャップが残っている。我々は、コード生成のためのRLVRトレーニングパイプラインに実行セマンティクスアライメントを統合する新しいアプローチであるCodeRL+を提案する。
参考スコア（独自算出の注目度）: 98.87395842351627
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CodeRL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CodeRL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CodeRL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CodeRL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CodeRL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CodeRL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CodeRL+ strengthens the alignment between code's textual representations and its underlying execution semantics.
Abstract（参考訳）: 大きな言語モデル(LLM)は巨大なコードコーパスから学習することでコード生成に優れていますが、テキストパターンのトレーニングと、形式的な実行セマンティクスによって管理される機能的正確性の目標との間には、基本的なセマンティクスのギャップが残っています。 RLVR(Reinforcement Learning with Verifiable Rewards)アプローチは、テストケースの実行から得られる報酬を使って、このギャップを埋めようとするものだ。しかし、バイナリパス/フェイル信号にのみ依存することは、コードのテキスト表現とその実行セマンティクス、特にコード内の微妙な論理的エラーとの整合性のある接続を確立するのに非効率である。本稿では、コード生成のためのRLVRトレーニングパイプラインに実行セマンティクスアライメントを統合する新しいアプローチであるCodeRL+を提案する。 CodeRL+は、変数レベルの実行軌跡を推論し、実行セマンティクスの直接学習信号を提供する。 CodeRL+は、既存のオンラインロールアウトから直接実行セマンティクスアライメントを構築し、さまざまなRLアルゴリズムとシームレスに統合することができる。大規模な実験では、CodeRL+はトレーニング後のベースライン(RLVRや蒸留を含む)より優れており、パス@1.6%の平均相対的な改善が達成されている。 CodeRL+は、コード推論とテスト出力生成ベンチマークでそれぞれ15.5%と4.4%の精度で、他のコーディングタスクに効果的に一般化する。 CodeRL+は多様なRLアルゴリズムとLLMにまたがる高い適用性を示している。さらに、プローブ分析は、CodeRL+がコードのテキスト表現とその基盤となる実行セマンティクスのアライメントを強化するという説得力のある証拠を提供する。

論文の概要: CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

関連論文リスト