Fugu-MT 論文翻訳(概要): Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

論文の概要: Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

arxiv url: http://arxiv.org/abs/2510.05446v1
Date: Mon, 06 Oct 2025 23:20:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.023758
Title: Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs
Title（参考訳）: 事前調整型メタRL:Thompson Smpling with Learned Priors and Guarantees in Finite-Horizon MDPs
Authors: Runlin Zhou, Chixiang Chen, Elynn Chen,
Abstract要約: 本研究では,有限水平MDPにおけるメタ強化学習について検討する。この結果から,学習Q-プライヤを用いたトンプソン型RLのメタレグレット保証が得られた。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ and place a Gaussian meta-prior $ \mathcal{N}(\theta^*_h,\Sigma^*_h)$ over the task-specific parameters $\theta^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim \tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL$^+$ track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.
Abstract（参考訳）: 本研究では,有限水平MDPにおけるメタ強化学習について検討する。具体的には、線形表現 $Q^*_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ を仮定し、タスク固有のパラメータ $\theta^{(k)}_h$ の上にガウスのメタプライヤ $ \mathcal{N}(\theta^*_h,\Sigma^*_h)$ を配置する。ランダム化値関数に基づいて、トンプソン型アルゴリズムを2つ提案する。 i) MTSRLは、先行平均のみを学習し、学習平均及び既知の共分散を伴う後続サンプリングを行う。 (ii)$\text{MTSRL}^{+}$ は共分散を推定し、有限サンプル推定誤差を制御するために事前拡大を利用する。さらに,学習前の後部を,学習前の後部と,学習前の真偽を知るメタオーラとを結合させてメタレグレット保証を得る手法を開発した。具体的には、既知の共分散に対して $\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret と、学習された共分散 $\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$ を得る。 MTSRL/MTSRL$^+$は、短い探索の後、メタオラクルを追跡し、事前非依存のRLとバンディットのみのメタベースラインを大幅に上回ることを示す。この結果から,学習したQ-priorsを用いたThompsonスタイルのRLのメタレグレット保証が得られ,実験に富んだ設定のための実用的なレシピ(RSSVIによるウォームスタート,OLSアグリゲーション,共分散ワイドニング)が提供される。

論文の概要: Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs

関連論文リスト