Fugu-MT 論文翻訳(概要): Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

論文の概要: Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition

arxiv url: http://arxiv.org/abs/2012.04053v2
Date: Wed, 10 Feb 2021 06:04:44 GMT
ステータス: 翻訳完了
システム内更新日: 2021-05-16 21:06:11.790157
Title: Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition
Title（参考訳）: 逆コストと既知の遷移を考慮した確率的最短経路のミニマックスレグレット
Authors: Liyu Chen, Haipeng Luo, Chen-Yu Wei
Abstract要約: 我々は、敵対コストと既知の移行で最短経路問題を研究します。ミニマックスの後悔は,全情報設定と盗聴フィードバック設定に対して$widetildeO(sqrtDTstar K)$および$widetildeO(sqrtDTstar SA K)$であることを示す。
参考スコア（独自算出の注目度）: 37.6975819766632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, %the usage of log-barrier with an increasing learning rate schedule, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.
Abstract（参考訳）: 逆コストと既知の遷移を伴う確率的最短経路問題を調べ、ミニマックスの後悔が$\widetilde{O}(\sqrt{DT^\star K})$および$\widetilde{O}(\sqrt{DT^\star SA K})$であることを示す。本研究は, 完全情報設定のみを考慮し, 準最適後悔を実現する, 既存の作業 (Rosenberg and Mansour, 2020) を大幅に改善した。我々の研究は、敵のコストによる盗聴フィードバックを初めて検討した。提案手法は,マルチスケール・エキスパートアルゴリズムの改良,一般確率的最短経路から特別なループフリーケースへの縮小,スキュード占有度測定空間,学習率スケジュールの増加に伴うログバーリアーの利用率,コスト推定器に追加した新しい補正項など,独立した関心を持つ新たな手法を駆使して,オンラインミラー降下フレームワーク上に構築されている。興味深いことに、最後の2つの要素は、正のバイアスによる学習者の分散と負のバイアスによる最適方針の分散をそれぞれ減少させ、同時にそれらを持つことは、バンディットフィードバック設定に束縛された最適な高い確率を得るために重要である。

関連論文リスト

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
人からのフィードバックから学ぶことは、大言語モデル(LLM)のような生成モデルを調整する上で重要な役割を果たす本稿では,このドメイン内のモデルについて考察する。-文脈的デュエルバンディット(contextual dueling bandits)と,正の選好ラベルを相手によって反転させることができる対向フィードバック(reversarial feedback)について考察する。本稿では,不確実性重み付き最大推定に基づく頑健なコンテキストデュエルバンドイット(RCDB)を提案する。
論文参考訳（メタデータ） (2024-04-16T17:59:55Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
線形混合遷移カーネルを用いた最短経路(SSP)問題について検討する。エージェントは繰り返し環境と対話し、累積コストを最小化しながら特定の目標状態に到達する。既存の作業は、イテレーションコスト関数の厳密な下限や、最適ポリシーに対する期待長の上限を仮定することが多い。
論文参考訳（メタデータ） (2024-02-14T07:52:00Z)
Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
本研究は,リスク許容度が$tau$のCVaR(Conditional Value at Risk)の目的に着目し,リスクに敏感な強化学習(RL)について検討する。ミニマックスCVaRの後悔率は$Omega(sqrttau-1AK)$で、$A$はアクションの数、$K$はエピソード数である。我々は,このアルゴリズムが連続性仮定の下で$widetilde O(tau-1sqrtSAK)$の最適後悔を達成し,一般に近似することを示す。
論文参考訳（メタデータ） (2023-02-07T02:22:31Z)
Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret [144.6358229217845]
エージェントが目標状態に到達する前に蓄積される期待コストを最小限に抑えるために,最短経路(ssp)設定で学習する問題について検討する。我々は,経験的遷移を慎重に歪曲し,探索ボーナスで経験的コストを摂動する新しいモデルベースアルゴリズムEB-SSPを設計する。私達はEB-SSPが$widetildeO(B_star sqrtS A K)$のミニマックスの後悔率を達成することを証明します。
論文参考訳（メタデータ） (2021-04-22T17:20:48Z)
Minimax Regret for Stochastic Shortest Path [63.45407095296692]
我々は、エージェントが最小の総予想コストで目標状態に達する必要がある最短パス(SSP)問題を研究します。この設定に対するminimaxの後悔は、$widetilde O(B_star sqrt|S| |A|K)$であり、$B_star$は任意の状態から最適なポリシーの予想コストに拘束されることを示しています。本アルゴリズムは, 有限水平MDPにおける強化学習の新たな削減を基礎として, エピソードごとのインタイム動作を行う。
論文参考訳（メタデータ） (2021-03-24T10:11:49Z)
Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case [29.99619764839178]
敵の費用と未知の遷移を伴う最短経路問題に対して,大きな進展をみせている。具体的には、全情報設定を後悔する$widetildeO(sqrtS3A2DT_star K)を実現するアルゴリズムを開発する。私たちはまた、最も難しい組み合わせとして、敵のコストによる盗聴フィードバックと未知の移行を最初に検討しています。
論文参考訳（メタデータ） (2021-02-10T06:33:04Z)
Near-optimal Regret Bounds for Stochastic Shortest Path [63.029132134792555]
最短経路 (SSP) は計画と制御においてよく知られた問題であり、エージェントは最小の総コストで目標状態に到達する必要がある。任意の学習アルゴリズムは、最悪の場合、少なくとも$Omega(B_star sqrt|S| |A|K)$後悔しなければならない。
論文参考訳（メタデータ） (2020-02-23T09:10:14Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。