Fugu-MT 論文翻訳(概要): Hide to Guide: Learning via Semantic Masking

論文の概要: Hide to Guide: Learning via Semantic Masking

arxiv url: http://arxiv.org/abs/2605.25198v1
Date: Sun, 24 May 2026 17:59:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.962952
Title: Hide to Guide: Learning via Semantic Masking
Title（参考訳）: Hide to Guide: セマンティックマスキングによる学習
Authors: Ruitao Liu, Qinghao Hu, Alex Hu, Yecheng Wu, Shang Yang, Luke J. Huang, Zhuoyang Zhang, Han Cai, Song Han,
Abstract要約: 本稿では,エキスパート誘導RLVRのためのセマンティックマスキング戦略を提案する。 SMEPOは、トレースを粗く切り刻む代わりに、重要な経路に沿って報酬関連セマンティックスをマスクする。 GRPよりも最大3.2ポイントの精度向上を実現し、トレーニング時間を最大4.2倍に短縮する。
参考スコア（独自算出の注目度）: 28.55894056629788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive tasks, but its effectiveness is often limited by exploration. For example, models often fail on hard problems, leaving little useful reward signal. External expert traces offer a natural source of guidance, yet they may also expose reward-relevant content along the critical path to the verifier target, such as final answers, intermediate values, executable implementations, or answer-related entities. This content can create an unintended reward hacking channel, allowing the policy to obtain reward by copying the trace rather than learning the underlying reasoning or agentic behavior. Existing guided-RL methods reduce this risk by using partial trajectories, but they mainly control how much expert information is shown heuristically rather than which parts should be hidden. To this end, we propose Semantic Masked Expert Policy Optimization (SMEPO), a fine-grained semantic masking strategy for expert-guided RLVR. Instead of truncating traces coarsely or revealing them unchanged, SMEPO masks reward-relevant semantic spans along the critical path while preserving the expert's decomposition, plan, and procedural structure. This turns hard problems from reasoning from scratch into a fill-in-the-blank process: the policy can follow the expert's problem-solving route, but must still reconstruct the missing values, code, or entities by itself. SMEPO is simple to apply and requires no changes to the reward function or RL objective. Across diverse domains, including math, code, and agentic search, SMEPO improves accuracy by up to 3.2 points over GRPO and reduces training time by up to 4.2x. The code is available at https://github.com/mit-han-lab/SMEPO.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、推論集約的なタスクにおいて言語モデルを改善するための強力なパラダイムとなっているが、その有効性は探索によって制限されることが多い。例えば、モデルは難しい問題で失敗することが多く、有用な報酬信号はほとんど残っていない。外部の専門家のトレースは、自然なガイダンスを提供するが、最終回答、中間値、実行可能実装、回答関連エンティティなど、検証対象のクリティカルパスに沿って報酬関連コンテンツを公開することもできる。このコンテンツは意図しない報酬のハッキングチャネルを作成することができ、基本となる推論やエージェントの振る舞いを学ぶのではなく、トレースをコピーすることで報酬を得ることができる。既存のガイド付きRL法では、部分的な軌跡を用いることでこのリスクを低減しているが、どの部分を隠すべきかというよりも、経験的情報がどのようにヒューリスティックに表示されるかを主に制御している。この目的のために,専門家誘導型RLVRのためのセマンティックマスケッドエキスパートポリシー最適化(SMEPO)を提案する。 SMEPOのマスクは、トレースを粗く切り離す代わりに、専門家の分解、計画、手続き構造を保ちながら、重要な経路に沿って報酬関連セマンティックスを分散させる。ポリシーは専門家の問題解決ルートに従うことができるが、いまだに欠落した値、コード、エンティティをそれ自体で再構築する必要がある。 SMEPOは簡単に適用でき、報酬関数やRLの目的を変更する必要はない。数学、コード、エージェント検索を含む様々な分野において、SMEPOはGRPO上で最大3.2ポイントの精度を向上し、トレーニング時間を最大4.2倍に短縮する。コードはhttps://github.com/mit-han-lab/SMEPOで公開されている。

論文の概要: Hide to Guide: Learning via Semantic Masking

関連論文リスト