Fugu-MT 論文翻訳(概要): Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

論文の概要: Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

arxiv url: http://arxiv.org/abs/2604.18161v1
Date: Mon, 20 Apr 2026 12:23:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.85656
Title: Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?
Title（参考訳）: 差別化可能なシミュレーターはより良い政策グラディエントを与えるか?
Authors: Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo,
Abstract要約: 不連続力学はバイアスを引き起こし、1階推定器の有効性を損なう。非滑らかな領域で推定器を切り替える軽量なテストであるDDCGを導入する。また,各ステップ毎の逆分散実装であるIVW-Hについて,明示的な不連続検出を伴わずに分散を安定化する。
参考スコア（独自算出の注目度）: 25.53040167917892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.
Abstract（参考訳）: 政策勾配強化学習において、微分可能なモデルへのアクセスにより、微分自由な0階推定器のみに依存するよりも学習を加速する1階勾配推定が可能となる。しかし、不連続力学はバイアスを引き起こし、1次推定器の有効性を損なう。以前の研究は、REINFORCE 0階勾配推定器の周りに信頼区間を構築し、これらの境界を用いて不連続性を検出することで、このバイアスに対処した。しかし,REINFORCE推定器はノイズが多く,タスク固有のハイパーパラメータチューニングが必要であり,サンプル効率が低いことが判明した。本稿では,そのようなバイアスが主な障害であり,最小限の修正が十分であるかどうかを問う。まず、従来の作業から不連続な設定を再検討し、非滑らかな領域で推定器を切り替える軽量なテストであるDDCGを導入する。第二に、微分可能なロボット制御タスクにおいて、明示的な不連続性検出をせずに分散を安定化し、強い結果をもたらすステップごとの逆分散実装であるIVW-Hを提案する。これらの結果から, 推定器の切り替えは制御された研究において堅牢性を向上させるが, 注意的分散制御は実践的な展開において支配的であることが示唆された。

関連論文リスト

A Unified Noise-Curvature View of Loss of Trainability [8.602734307457387]
継続的学習におけるトレーニング容易性(LoT)の喪失は、タスクが進化するにつれて、ステップが改善されなくなります。バッチサイズ対応勾配雑音境界と曲率ボラティリティ制御境界の2つの相補的基準を導入する。このしきい値を用いることで、各レイヤを安全な限界以下に維持する、単純なレイヤごとのスケジューラを構築します。
論文参考訳（メタデータ） (2025-09-24T02:11:13Z)
C-Learner: Constrained Learning for Causal Inference [4.370964009390564]
本稿では,両世界の最適な重み付けを実現し,安定したプラグイン推定を実現する新しいデバイアス化手法を提案する。我々の制約学習フレームワークは、プラグイン量に対する一階誤差がゼロであるという制約の下で、最高のプラグイン推定器を解く。
論文参考訳（メタデータ） (2024-05-15T16:38:28Z)
Estimating the Hessian Matrix of Ranking Objectives for Stochastic Learning to Rank with Gradient Boosted Trees [63.18324983384337]
グラディエントブースト決定木(GBDT)のランク付け手法について紹介する。我々の主な貢献は、二階微分、すなわちヘッセン行列に対する新しい推定器である。推定器を既存のPL-Rankフレームワークに組み込む。
論文参考訳（メタデータ） (2024-04-18T13:53:32Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
ディープラーニングにおけるミススキャリブレーションとは、予測された信頼とパフォーマンスの間には相違がある、という意味である。トレーニング中に何を学ぶべきかを学ぶことを目的とした動的正規化(DReg)を導入し、信頼度調整のトレードオフを回避する。
論文参考訳（メタデータ） (2024-02-13T11:25:20Z)
Learning to Estimate Without Bias [57.82628598276623]
ガウスの定理は、重み付き最小二乗推定器は線形モデルにおける線形最小分散アンバイアスド推定(MVUE)であると述べている。本稿では、バイアス制約のあるディープラーニングを用いて、この結果を非線形設定に拡張する第一歩を踏み出す。 BCEの第二の動機は、同じ未知の複数の推定値が平均化されてパフォーマンスが向上するアプリケーションにおいてである。
論文参考訳（メタデータ） (2021-10-24T10:23:51Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sample (AIS) と関連するアルゴリズムは、限界推定のための非常に効果的なツールである。差別性は、目的として限界確率を最適化する可能性を認めるため、望ましい性質である。我々はメトロポリス・ハスティングスのステップを放棄して微分可能アルゴリズムを提案し、ミニバッチ計算をさらに解き放つ。
論文参考訳（メタデータ） (2021-07-21T17:10:14Z)
Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator [93.05919133288161]
一般的なGumbel-Softmax推定器のストレートスルー変量の分散は、ラオ・ブラックウェル化により減少できることを示す。これは平均二乗誤差を確実に減少させる。これは分散の低減、収束の高速化、および2つの教師なし潜在変数モデルの性能向上につながることを実証的に実証した。
論文参考訳（メタデータ） (2020-10-09T22:54:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。