Fugu-MT 論文翻訳(概要): Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

論文の概要: Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

arxiv url: http://arxiv.org/abs/2605.08053v1
Date: Fri, 08 May 2026 17:41:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.248451
Title: Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs
Title（参考訳）: 指数的ユーティリティのための強化学習:分散MDPにおけるアルゴリズムと収束性
Authors: Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat,
Abstract要約: マルコフ決定過程における指数効用最適化のための強化学習(RL)は、原則的値ベースアルゴリズムを欠いている。 2つのQ値型拡張を導出し、関連する作用素が$L_infty$とsup-log/Thompsonメトリクスの縮約であることを示す。我々は、時間スケールの分離により、ほぼ全周収束を確立し、有限時間収束率を与えるとともに、サブ線形パワーロー演算子によって制御される1時間スケールのアルゴリズムを提供する。
参考スコア（独自算出の注目度）: 2.574071344130061
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.
Abstract（参考訳）: 割引マルコフ決定過程(MDP)における指数効用最適化のための強化学習(RL)は、原則的値ベースアルゴリズムを欠いている。固定リスク回避設定におけるこのギャップに対処する。指数的効用のためのベルマン型方程式に基づいて、二つのQ値型拡張を導出し、関連する作用素がそれぞれ$L_\infty$とsup-log/Thompsonメトリックの縮約であることを示す。我々はそれらの固定点を特徴付け、誘導された欲求的定常政策が定常政策の指数的効用目標に最適であることを証明した。これらの構造的結果は、2時間スケールのQ-ラーニングスタイルのアルゴリズムと、時間スケールの分離によってほぼ全周収束を確立し、有限時間収束率を提供するアルゴリズムと、サブ線形のパワーロー演算子によって支配される1時間スケールのアルゴリズムの2つのモデルフリーアルゴリズムに導かれる。後者は標準計量における大域的縮約を認めないので、局所リプシッツ性、単調性、均一性、およびディニ微分に基づく微妙な議論を用いて収束を証明し、ベクトルの場合の収束率を得る際の課題を強調するスカラー有限時間解析を提供する。我々の研究は指数効用目標の下で価値に基づくRLの基礎を提供する。

論文の概要: Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

関連論文リスト