Fugu-MT 論文翻訳(概要): Investigating Scale Independent UCT Exploration Factor Strategies

論文の概要: Investigating Scale Independent UCT Exploration Factor Strategies

arxiv url: http://arxiv.org/abs/2510.21275v1
Date: Fri, 24 Oct 2025 09:19:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.423407
Title: Investigating Scale Independent UCT Exploration Factor Strategies
Title（参考訳）: 大規模独立型UTT探索因子の探索
Authors: Robin Schmöcker, Christoph Schnell, Alexander Dockhorn,
Abstract要約: アッパー信頼境界木アルゴリズムは、それが適用されるゲームの報酬スケールに依存しない。本稿では,ゲーム報酬スケールに依存しない$lambda$-strategiesを適応的に選択するための様々な戦略を評価する。
参考スコア（独自算出の注目度）: 42.13843953705695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\{-1,0,1\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant $\lambda$, called $\lambda$-strategies, that are agnostic to the game's reward scale. These $\lambda$-strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested $\lambda$-strategies, which is to choose $\lambda$ as $2 \cdot \sigma$ where $\sigma$ is the empirical standard deviation of all state-action pairs' Q-values of the search tree. This method outperforms existing $\lambda$-strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.
Abstract（参考訳）: アッパー信頼境界木(UCT)アルゴリズムは、適用されるゲームの報酬スケールに依存しない。ゲーム終盤に$\{-1,0,1\}$のスパース報酬を持つゼロサムゲームの場合、これは問題ではないが、多くのゲームは手書きの報酬スケールで密度の高い報酬を特徴付けることが多く、ノードのQ値が異なるゲームにまたがる。本稿では, UCT 探索定数 $\lambda$, $\lambda$-strategies を適応的に選択するための様々な戦略を評価する。これらの$\lambda$-strategiesには、文献で提案されているものに加えて、5つの新しい戦略が含まれている。実験結果から、新たに提案した$\lambda$-strategiesの1つは、$\lambda$ as $2 \cdot \sigma$を選択することを推奨している。このメソッドは、単一のパラメータ値と、利用可能なすべてのパラメータを最適化して得られるピークパフォーマンスの両方の観点から、既存の$\lambda$-strategieよりも優れています。

関連論文リスト

Grouped Satisficing Paths in Pure Strategy Games: a Topological Perspective [15.76917401735207]
MARLアルゴリズムで広く採用されている原則は「ウィンステイ、負けシフト」であり、エージェントが最高の応答を達成すれば現在の戦略を維持することを指示する。本稿では,そのような特性に対して十分な条件を確立し,任意の有限状態マルコフゲーム,および任意の$N$-playerゲームが有限長充足パスの存在を保証することを示す。
論文参考訳（メタデータ） (2025-09-27T07:07:27Z)
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
オンライン有限水平マルコフ決定過程を逆向きに変化した損失と総括的帯域幅フィードバック(フルバンド幅)を用いて研究する。この種のフィードバックの下では、エージェントは、軌跡内の各中間段階における個々の損失よりも、軌跡全体に生じる総損失のみを観察する。この設定のための最初のポリシー最適化アルゴリズムを紹介します。
論文参考訳（メタデータ） (2025-02-06T12:03:24Z)
Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
我々のアルゴリズムが $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping is linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ is the standardities of the state and action space。
論文参考訳（メタデータ） (2023-05-15T05:37:32Z)
Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus [48.34563955829649]
本稿では,共同戦略の信頼区間を構築する戦略的な集中原理を提案する。 2人のプレイヤーによるゼロサムマルコフゲームの場合、戦略的なボーナスの凸性を利用して効率的なアルゴリズムを提案する。すべてのアルゴリズムは、指定済みの戦略クラスである$Pi$を入力として取り、最良の戦略に近い戦略を$Pi$で出力することができる。
論文参考訳（メタデータ） (2022-06-01T00:18:15Z)
Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
タスクに依存しない強化学習のための効率的なアルゴリズムを提案する。このアルゴリズムは1/epsilon cdot (H3SA / rho + H4 S2 A) の$widetildemathcalOのみを探索する。情報理論上、この境界は$rho Theta (1/(HS))$と$H>1$に対してほぼ厳密であることを示す。
論文参考訳（メタデータ） (2021-08-11T20:42:46Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
本稿では,ランダム化最小二乗値反復(RLSVI)アルゴリズムに着想を得たモデルレス強化学習アルゴリズムを提案する。提案アルゴリズムは,スカラーノイズを用いたトレーニングデータを簡易に摂動させることにより,探索を促進する。我々はこの理論を、既知の困難な探査課題にまたがる実証的な評価で補完する。
論文参考訳（メタデータ） (2021-06-15T02:23:07Z)
Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
我々は,直接の監督なしに自己対決で最適な政策を学習するセルフプレイアルゴリズムに焦点をあてる。本稿では,サンプル複雑性を$tildemathcalO(SAB)$,サンプル複雑性を$tildemathcalO(S(A+B)$とする新しいemphNash Vラーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2020-06-22T05:00:13Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。