Fugu-MT 論文翻訳(概要): Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

論文の概要: Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning

arxiv url: http://arxiv.org/abs/2011.13034v3
Date: Wed, 27 Oct 2021 20:28:04 GMT
ステータス: 翻訳完了
システム内更新日: 2022-09-21 02:12:53.947706
Title: Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning
Title（参考訳）: 選抜顧客への適応:多目的強化学習における後悔と探索の複雑さ
Authors: Jingfeng Wu, Vladimir Braverman, Lin F. Yang
Abstract要約: 目的と目的のバランスをとる多目的強化学習について、好みを用いて検討する。我々はこの問題をマルコフ決定過程における叙述的学習問題として定式化する。モデルに基づくアルゴリズムは、最小限の最小限のリセットを$widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$とする。
参考スコア（独自算出の注目度）: 43.75491612671571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to $\epsilon$ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity $\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{\epsilon^2}\bigr)$. This result partly resolves an open problem raised by \citet{jin2020reward}.
Abstract（参考訳）: 本稿では,目的のバランスをとる多目的強化学習について検討する。実際には、その好みはしばしば敵対的な方法で与えられ、例えば、顧客は多くのアプリケーションで選択される。我々は,この問題をマルコフ決定過程における認識論的学習問題として定式化する。そこでは遷移が未知であり,報奨関数が事前特定された多目的報奨関数を持つ選好ベクトルの内積である。 2つの設定を考えます。オンライン環境では、エージェントは各エピソードごとに(敵対的な)好みを受け取り、環境と対話するためのポリシーを提案する。モデルに基づくアルゴリズムでは、最小限の最小限の後悔値を持つ$\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$で、$d$は目的数、$S$は行動数、$A$は行動数、$H$は地平線の長さ、$K$はエピソード数である。さらに、選好のない探索、すなわち、エージェントはまず選好を指定せずに環境と相互作用し、次に$\epsilon$エラーまで任意の選好ベクトルを適応することができる。提案アルゴリズムは, ほぼ最適軌跡複雑性$\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{\epsilon^2}\bigr)$で有効である。この結果は部分的に \citet{jin2020reward} によって提起された開問題を解く。

関連論文リスト

Online Matrix Completion: A Collaborative Approach with Hott Items [19.781869063637387]
M$ユーザ,$N$アイテム,$T$ラウンド,未知のランク-$r$報酬行列$Rin mathbbRMtimes N$のオンライン設定における下位行列補完問題について検討する。
論文参考訳（メタデータ） (2024-08-11T18:49:52Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
本研究では,有限水平マルコフ決定過程(MDP)によってモデル化されたエピソディック強化学習(RL)問題をバッチ数に制約を加えて検討する。我々は,$tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot)をほぼ最適に後悔するアルゴリズムを設計し,$(S,A,H,K)$の対数項を$K$で隠蔽する。技術的貢献は2つある: 1) 探索のためのほぼ最適設計スキーム
論文参考訳（メタデータ） (2022-10-15T09:22:22Z)
Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
報酬混合マルコフ決定過程(RMMDP)におけるエピソード強化学習の検討我々のゴールは、そのようなモデルにおける時間段階の累積報酬をほぼ最大化する、ほぼ最適に近いポリシーを学ぶことである。
論文参考訳（メタデータ） (2022-10-05T22:52:00Z)
Near-Optimal Goal-Oriented Reinforcement Learning in Non-Stationary Environments [40.027926921772355]
目標指向強化学習における動的後悔の研究を行う。この下位境界における$Delta_c$と$Delta_P$の異なる役割は、コストと遷移を別々に見積もるアルゴリズムを設計するきっかけとなった。
論文参考訳（メタデータ） (2022-05-25T20:29:01Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
エピソードマルコフ決定過程(MDP)に対する線形関数近似を用いたモデルに基づく無報酬強化学習について検討する。計画段階では、特定の報酬関数が与えられ、探索フェーズから収集したサンプルを使用して良い政策を学ぶ。任意の報酬関数に対して$epsilon$-optimal Policyを得るには,最大$tilde O(H4d(H + d)epsilon-2)$ episodesをサンプリングする必要がある。
論文参考訳（メタデータ） (2021-10-12T23:03:58Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
タスクに依存しない強化学習のための効率的なアルゴリズムを提案する。このアルゴリズムは1/epsilon cdot (H3SA / rho + H4 S2 A) の$widetildemathcalOのみを探索する。情報理論上、この境界は$rho Theta (1/(HS))$と$H>1$に対してほぼ厳密であることを示す。
論文参考訳（メタデータ） (2021-08-11T20:42:46Z)
Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits with Linear Payoff Functions [53.77572276969548]
我々は、C$2$UCBアルゴリズムが分割マトロイド制約に対して最適な後悔結合$tildeO(dsqrtkT + dk)$を有することを示した。一般的な制約に対して,C$2$UCBアルゴリズムで腕の報酬推定値を変更するアルゴリズムを提案する。
論文参考訳（メタデータ） (2021-01-20T04:29:18Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
本稿では、特にバッチ強化学習に適した報酬不要強化学習フレームワークと、複数の報酬関数に対するポリシーを必要とするシナリオについて検討する。 textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname) という新しい効率的なアルゴリズムを提供しています。
論文参考訳（メタデータ） (2020-10-12T17:51:19Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。