Fugu-MT 論文翻訳(概要): N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

論文の概要: N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

arxiv url: http://arxiv.org/abs/2606.10768v1
Date: Tue, 09 Jun 2026 12:21:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.483363
Title: N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization
Title（参考訳）: N-GRPO: 政策最適化のための埋め込みレベル近隣混合
Authors: Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu,
Abstract要約: 我々は,グループ相対政策最適化フレームワークに統合された新しい探索戦略であるN-GRPOを紹介する。トークンレベルのサンプリングやネイティブな埋め込みレベルのノイズに頼るのではなく、Semantic Neighbor Mixingを活用する。 N-GRPOは、数学推論ベンチマークの強いベースラインよりも一貫した改善を達成し、また分布外タスクの堅牢な一般化能力を示す。
参考スコア（独自算出の注目度）: 55.14402862283128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.
Abstract（参考訳）: 数学的推論における大規模言語モデルの成功は、ロールアウトフェーズにおける多様かつ有効なソリューションパスの生成に大きく依存している。しかし、現在のロールアウト技術は基本的なトレードオフに直面している。トークンレベルのサンプリングは、リフレージングでのみ異なる冗長なトラジェクトリを生成することが多いが、ランダムノイズを利用した埋め込みレベルの手法は、セマンティック一貫性を乱すことが多い。そこで我々は,グループ相対政策最適化(GRPO)フレームワークに組み込まれた新たな探索戦略であるN-GRPOを紹介する。トークンレベルのサンプリングやネイティブな埋め込みレベルのノイズに頼るのではなく、Semantic Neighbor Mixingを活用する。このメカニズムは、アンカートークンとその最も近いセマンティックな隣人の埋め込みを混合することで入力表現を動的に構築し、それによって局所的セマンティック多様体に厳密に固執しながら多様性を注入する。異なるサイズにわたるDeepSeek-R1-Distill-Qwenモデルの実験的評価により、N-GRPOは数学推論ベンチマークの強いベースラインよりも一貫した改善を達成できるだけでなく、分布外タスクの堅牢な一般化能力も示している。

論文の概要: N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

関連論文リスト