Fugu-MT 論文翻訳(概要): Exact Unlearning in Reinforcement Learning

論文の概要: Exact Unlearning in Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.04182v1
Date: Tue, 02 Jun 2026 19:54:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.36025
Title: Exact Unlearning in Reinforcement Learning
Title（参考訳）: 強化学習におけるエクササイズアンラーニング
Authors: Thanh Nguyen-Tang, Raman Arora,
Abstract要約: そこで我々は,$$-TV-stable の強化学習アルゴリズムが存在し,正確なアンラーニング手順をサポートすることを示す。また、$(Hsqrt!SAT! +! SAH/)$ for $-TV-stable RL algorithm という下限を定め、アルゴリズムが極小に近いことを示す。
参考スコア（独自算出の注目度）: 43.97082684655346
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output after unlearning is \emph{indistinguishable} from what would have been produced had the deleted user never interacted with the learner. For any $ρ>0$, we show that there exists a reinforcement learning (RL) algorithm that is $ρ$-TV-stable and supports an exact unlearning procedure whose expected computational cost is only a $ρ\sqrt{\ln T}$ fraction of the computational cost of retraining from scratch. We construct such a $ρ$-TV-stable RL algorithm for tabular Markov decision processes (MDPs), which achieves a regret bound of $\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/ρ)$, where $S, A, H$, and $T$ denote the number of states, the number of actions, the episode horizon, and the number of episodes, respectively. We also establish a lower bound of $Ω(H\sqrt{\!SAT}\! +\! {SAH}/ρ)$ for $ρ$-TV-stable RL algorithms, showing that our algorithm is nearly minimax optimal.
Abstract（参考訳）: 強化学習における「emph{exact unlearning}」の問題は、削除要求によってユーザのデータを削除できる効率的なフレームワークを設計することであり、つまり、未学習後のオンライン学習者のアウトプットが「emph{indistinguishable}」である。任意の$ρ>0$に対して、RLアルゴリズムは$ρ$-TV-stableであり、予測される計算コストが$ρ\sqrt{\ln T} であるような正確な未学習手順をサポートする。我々は,表形式マルコフ決定過程(MDPs)に対する$ρ$-TV-stable RLアルゴリズムを構築し,$\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/ρ)$,$S,A,H$,$T$はそれぞれ状態数,アクション数,エピソード水平度,エピソード数を表す。また、$Ω(H\sqrt{\! SAT}\! +\! {SAH}/ρ)$ for $ρ$-TV-stable RL algorithm, show that our algorithm are almost minimax optimal。

関連論文リスト

Regret-Optimal Federated Transfer Learning for Kernel Regression with Applications in American Option Pricing [8.723136784230906]
本稿では、中央プランナーがデータセットにアクセス可能なフェデレーショントランスファー学習のための最適反復スキームを提案する。我々の目標は、生成されたパラメータの累積偏差を$thetai(t)_t=0T$で最小化することである。後悔と最適化のアルゴリズム内で対称性を活用することで, $mathcalO(Np2)$少なめの初等演算を伴って動作する,ほぼ後悔のいく$_optimalを開発する。
論文参考訳（メタデータ） (2023-09-08T19:17:03Z)
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes [21.77276136591518]
我々はマルコフ決定過程(MDPs)のための証明可能なモデルフリー強化学習(RL)アルゴリズムを開発した。シミュレータ設定では,$widetildeO left(fracSAmathrmsp(h*)epsilon2+fracS2Amathrmsp(h*)epsilon2right)$サンプルを用いて,$epsilon$-optimal Policyを求める。
論文参考訳（メタデータ） (2023-06-28T17:43:19Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
本研究では,有限水平マルコフ決定過程(MDP)によってモデル化されたエピソディック強化学習(RL)問題をバッチ数に制約を加えて検討する。我々は,$tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot)をほぼ最適に後悔するアルゴリズムを設計し,$(S,A,H,K)$の対数項を$K$で隠蔽する。技術的貢献は2つある: 1) 探索のためのほぼ最適設計スキーム
論文参考訳（メタデータ） (2022-10-15T09:22:22Z)
Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
我々は,$widetildeO(sqrtH4S2AT)$を,切り替えコストが$O(HSA loglog T)$を要求されたことを後悔する新しいアルゴリズムを提案する。副産物として、我々の新しいアルゴリズムは、最適な切替コストが$O(HSA)$のエンフレワードフリー探索アルゴリズムを導出することができる。
論文参考訳（メタデータ） (2022-02-13T19:01:06Z)
Contextual Recommendations and Low-Regret Cutting-Plane Algorithms [49.91214213074933]
本稿では、ナビゲーションエンジンやレコメンデーションシステムにおけるルーティングアプリケーションによって動機付けられた、コンテキスト線形帯域の次の変種について考察する。我々は、真の点$w*$と分離オラクルが返す超平面の間の全距離を、低い「回帰」を持つ新しい切断平面アルゴリズムを設計する。
論文参考訳（メタデータ） (2021-06-09T05:39:05Z)
Provably Efficient Reinforcement Learning with Linear Function Approximation Under Adaptivity Constraints [94.76881135901753]
一般的な限定的適応モデルとして,バッチ学習モデルとレアポリシースイッチモデルがある。提案したLSVI-UCB-Batchアルゴリズムは,$tilde O(sqrtd3H3T + dHT/B)$ regretを実現する。まれなポリシスイッチモデルでは,提案されたLSVI-UCB-RareSwitchアルゴリズムは,$tilde O(sqrtd3H3T[1+T/(dH)]dH/B)$の後悔を享受する。
論文参考訳（メタデータ） (2021-01-06T18:56:07Z)
Nearly Minimax Optimal Reinforcement Learning for Discounted MDPs [99.59319332864129]
UCBVI-$gamma$が$tildeObig(sqrtSAT/ (1-gamma)1.5big)$ regret, where $S$ is the number of state, $A$ is the number of action, $gamma$ is the discount factor, $T$ is the number of steps。さらに、ハードMDPのクラスを構築し、任意のアルゴリズムに対して、期待される後悔は少なくとも$tildeOmegabig(sqrtSAT/)であることを示す。
論文参考訳（メタデータ） (2020-10-01T17:57:47Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。