Fugu-MT 論文翻訳(概要): Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

論文の概要: Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

arxiv url: http://arxiv.org/abs/2603.05900v1
Date: Fri, 06 Mar 2026 04:39:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.087828
Title: Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning
Title（参考訳）: LLM推論による分子最適化のための基準誘導政策最適化
Authors: Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han,
Abstract要約: 大規模言語モデル(LLM)は、教師付き微調整(SFT)と、推論タスクにおける検証可能な報酬(RLVR)による強化学習の恩恵を受ける。基準分子上の応答のみのSFTは推論を崩壊させ、RLVRは類似性制約下でスパースフィードバックを提供する。本稿では、軌道データを必要としない参照分子から学習する最適化手法である参照誘導政策最適化(RePO)を紹介する。
参考スコア（独自算出の注目度）: 58.644854860003704
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.
Abstract（参考訳）: 大規模言語モデル(LLM)は、教師付き微調整(SFT)と推論タスクにおける検証可能な報酬(RLVR)による強化学習の恩恵が大きい。しかし、これらのレシピは命令ベースの分子最適化では不十分であり、各データポイントは通常、1つの最適化された参照分子のみを提供し、ステップバイステップの最適化は行わない。 RLVRは、モデルが効果的な探索を欠いているため、類似性制約の下でスパースフィードバックを提供し、学習を遅くし、最適化を制限する。基準分子の利用のバランスを保ちながら新しい分子の探索を促進するため,軌道データを必要としない基準分子から学習する最適化手法である参照誘導政策最適化(RePO)を導入する。各更新では、RePOはモデルから中間的推論軌道を持つ候補分子をサンプリングし、RL方式で類似性制約の下で特性満足度を測定する検証可能な報酬を用いてモデルを訓練する。一方、政策の中間的推論軌跡を文脈として保持し、指導的な方法で回答のみを訓練することにより、参照ガイダンスを適用する。同時に、RL項は探索を促進し、ガイダンス項は報酬の空間性を緩和し、多くの有効な分子編集が存在する場合の基準に出力を接地することで訓練を安定化する。分子最適化ベンチマーク全体を通じて、RePOはSFTとRLVRのベースライン(例えばGRPO)を一貫して上回り、最適化基準の改善(Success Rate $\times$ similarity)、競合する目的間のバランスの改善、そして目に見えない命令スタイルの一般化を実現している。私たちのコードはhttps://github.com/tmlr-group/RePO.comで公開されています。

論文の概要: Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

関連論文リスト