Fugu-MT 論文翻訳(概要): Sparse Offline Reinforcement Learning with Corruption Robustness

論文の概要: Sparse Offline Reinforcement Learning with Corruption Robustness

arxiv url: http://arxiv.org/abs/2512.24768v1
Date: Wed, 31 Dec 2025 10:28:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-01 23:27:28.62448
Title: Sparse Offline Reinforcement Learning with Corruption Robustness
Title（参考訳）: 破壊ロバスト性を考慮したスパースオフライン強化学習
Authors: Nam Phuong Tran, Andi Nika, Goran Radanovic, Long Tran-Thanh, Debmalya Mandal,
Abstract要約: オフラインスパース強化学習(RL)における強データ破損に対する堅牢性の検討我々の設定では、敵は高次元だがスパースなマルコフ決定過程から収集された軌道のごく一部を任意に摂動することができる。本研究は, 高次元スパースMDPにおいて, 単一政治中心性カバレッジと汚職を伴う非空洞性保証を初めて提供するものである。
参考スコア（独自算出の注目度）: 24.193236728009918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse Markov decision process, and our goal is to estimate a near optimal policy. The main challenge is that, in the high-dimensional regime where the number of samples $N$ is smaller than the feature dimension $d$, exploiting sparsity is essential for obtaining non-vacuous guarantees but has not been systematically studied in offline RL. We analyse the problem under uniform coverage and sparse single-concentrability assumptions. While Least Square Value Iteration (LSVI), a standard approach for robust offline RL, performs well under uniform coverage, we show that integrating sparsity into LSVI is unnatural, and its analysis may break down due to overly pessimistic bonuses. To overcome this, we propose actor-critic methods with sparse robust estimator oracles, which avoid the use of pointwise pessimistic bonuses and provide the first non-vacuous guarantees for sparse offline RL under single-policy concentrability coverage. Moreover, we extend our results to the contaminated setting and show that our algorithm remains robust under strong contamination. Our results provide the first non-vacuous guarantees in high-dimensional sparse MDPs with single-policy concentrability coverage and corruption, showing that learning a near-optimal policy remains possible in regimes where traditional robust offline RL techniques may fail.
Abstract（参考訳）: オフラインスパース強化学習(RL)における強データ破損に対する堅牢性について検討した。我々の設定では、敵は高次元だがスパースなマルコフ決定過程から収集された軌道のごく一部を任意に摂動させ、我々の目標は、ほぼ最適な政策を推定することである。主な課題は、サンプル数$N$が特徴次元$d$よりも小さい高次元の状態では、空でない保証を得るには空間性を利用することが不可欠であるが、オフラインのRLでは体系的に研究されていないことである。我々は、一様カバレッジとスパース単集中性仮定の下で問題を解析する。堅牢なオフラインRLの標準手法であるLast Square Value Iteration (LSVI) は、均一なカバレッジ下では良好に機能するが、LSVIへのスペーサリティの統合は不自然であり、過度に悲観的なボーナスのためにその分析が破綻する可能性がある。そこで本稿では, 個別の政治中心性カバレッジ下での, 疎密なオフラインRLに対する初となる非空き保証を実現するために, 疎密なロバストなオラクルを用いたアクタ批判手法を提案する。さらに, 実験結果を汚染条件まで拡張し, 強い汚染条件下では, アルゴリズムが頑健であることを示す。その結果,従来のロバストなオフラインRL技術が失敗する状況下では,単一政治中心性カバレッジと汚職を伴う高次元スパースMDPにおいて,ほぼ最適政策の学習が可能であることが示唆された。

論文の概要: Sparse Offline Reinforcement Learning with Corruption Robustness

関連論文リスト