Fugu-MT 論文翻訳(概要): Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

論文の概要: Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.22229v1
Date: Fri, 24 Apr 2026 05:07:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.347514
Title: Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
Title（参考訳）: 対応しない保存支援:オフライン強化学習のための動的ルーティング
Authors: Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang,
Abstract要約: 1ステップのオフラインRLアクターは、長い反復サンプリングを通したバックプロパゲートを避けるため、魅力的である。本稿では,トップ1動的ルーティングを訓練した潜在条件付きワンステップアクタDROLを提案する。
参考スコア（独自算出の注目度）: 11.929005952313261
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent-conditioned one-step actor trained with top-1 dynamic routing. For each state, the actor samples $K$ candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one-step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single-pass inference at test time. On OGBench and D4RL, DROL is competitive with the one-step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: https://muzhancun.github.io/preprints/DROL.
Abstract（参考訳）: 1ステップのオフラインRLアクターは、長い反復的なサンプリングラによるバックプロパゲートを回避し、推論を安く保つため魅力的だが、データセットがサポートできるアクションから遠ざかることなく、批判の下で改善する必要がある。最近の1ステップの抽出パイプラインでは、強力な反復的な教師が各潜在引き分けに対して1つのターゲットアクションを提供し、同じ学生出力が両方のジョブを実行するように要求される。これら2つの方向が一致しない場合、データによって近接したより良いアクションがローカルにサポートされた場合でも、損失は同じサンプルに対する妥協として解決する。本稿では,トップ1動的ルーティングを訓練した潜在条件付きワンステップアクタDROLを提案する。各状態について、アクターは、境界付き潜在候補から$K$の候補アクションをサンプリングし、各データセットアクションを最も近い候補に割り当て、その勝者のみをビヘイビアクローンと批判ガイダンスで更新する。ルーティングは現在の候補ジオメトリから再計算されるため、サポート対象領域のオーナシップは、学習の過程で候補間でシフトすることができる。これにより、テスト時にシングルパス推論を保持しながら、ポイントワイズ抽出がキャプチャに苦労するローカル改善を行うためのワンステップアクタルームが提供される。 OGBenchとD4RLでは、DROLはワンステップのFQLベースラインと競合し、多くのOGBenchタスクグループを改善しながら、AntMazeとAdroitに強く依存している。プロジェクトページ: https://muzhancun.github.io/preprints/DROL。

論文の概要: Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

関連論文リスト