Fugu-MT 論文翻訳(概要): Safe Linear Bandits over Unknown Polytopes

論文の概要: Safe Linear Bandits over Unknown Polytopes

arxiv url: http://arxiv.org/abs/2209.13694v3
Date: Mon, 1 Jul 2024 15:26:27 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-02 18:41:09.832418
Title: Safe Linear Bandits over Unknown Polytopes
Title（参考訳）: 未知のポリトープ上の安全リニアバンド
Authors: Aditya Gangrade, Tianrui Chen, Venkatesh Saligrama,
Abstract要約: 安全線形バンディット問題(英: safe linear bandit problem、SLB)は、線形プログラミングのオンライン手法である。ポリトープ上でのSLBの有効性とスムーズな安全性のトレードオフについて検討した。
参考スコア（独自算出の注目度）: 39.177982674455784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The safe linear bandit problem (SLB) is an online approach to linear programming with unknown objective and unknown roundwise constraints, under stochastic bandit feedback of rewards and safety risks of actions. We study the tradeoffs between efficacy and smooth safety costs of SLBs over polytopes, and the role of aggressive doubly-optimistic play in avoiding the strong assumptions made by extant pessimistic-optimistic approaches. We first elucidate an inherent hardness in SLBs due the lack of knowledge of constraints: there exist `easy' instances, for which suboptimal extreme points have large `gaps', but on which SLB methods must still incur $\Omega(\sqrt{T})$ regret or safety violations, due to an inability to resolve unknown optima to arbitrary precision. We then analyse a natural doubly-optimistic strategy for the safe linear bandit problem, DOSS, which uses optimistic estimates of both reward and safety risks to select actions, and show that despite the lack of knowledge of constraints or feasible points, DOSS simultaneously obtains tight instance-dependent $O(\log^2 T)$ bounds on efficacy regret, and $\tilde O(\sqrt{T})$ bounds on safety violations. Further, when safety is demanded to a finite precision, violations improve to $O(\log^2 T).$ These results rely on a novel dual analysis of linear bandits: we argue that \algoname proceeds by activating noisy versions of at least $d$ constraints in each round, which allows us to separately analyse rounds where a `poor' set of constraints is activated, and rounds where `good' sets of constraints are activated. The costs in the former are controlled to $O(\log^2 T)$ by developing new dual notions of gaps, based on global sensitivity analyses of linear programs, that quantify the suboptimality of each such set of constraints. The latter costs are controlled to $O(1)$ by explicitly analysing the solutions of optimistic play.
Abstract（参考訳）: 安全線形バンディット問題(英: safe linear bandit problem, SLB)とは、線形プログラミングにおいて、報酬の確率的ランディットフィードバックと行動の安全性リスクを考慮し、未知の目的と未知の円周的制約を持つオンライン手法である。本研究では,ポリトープ上でのSLBの有効性とスムーズな安全コストのトレードオフと,既存の悲観的最適化アプローチによる強い仮定を回避するための積極的二重最適化プレイの役割について検討する。制約の知識の欠如により、SLBの固有の硬さを最初に解明する: 最適でない極小点が大きな 'ギャップ' を持つ 'easy' インスタンスが存在するが、SLB メソッドが依然として$\Omega(\sqrt{T})$ 後悔や安全違反を起こさなければならないのは、未知のオプティマを任意の精度で解くことができないためである。次に、安全線形バンディット問題に対する自然な2倍最適化戦略であるDOSSについて、報酬リスクと安全リスクの両方を楽観的に推定して行動を選択するとともに、制約や実現可能な点の知識が欠如しているにもかかわらず、DOSSは後悔に対する厳密なインスタンス依存の$O(\log^2T)と、安全違反に関する$\tilde O(\sqrt{T})$バウンドを同時に得ることを示す。さらに、安全性を有限精度で要求すると、違反は$O(\log^2T)に改善される。これらの結果は、線形バンディットの新たな双対解析に依存している:我々は、 \algonameは各ラウンドで少なくとも$d$制約のノイズバージョンを活性化することで進行し、これにより、'poor' 制約セットが活性化されたラウンドと 'good' 制約セットが活性化されたラウンドを別々に分析できると主張している。前者のコストを$O(\log^2 T)$に制御し、線形プログラムの大域的感度解析に基づいてギャップの新たな双対概念を開発し、これらの制約の集合の準最適性を定量化する。後者のコストは楽観的なプレイの解を明示的に分析することで$O(1)$に制御される。

論文の概要: Safe Linear Bandits over Unknown Polytopes

関連論文リスト