Fugu-MT 論文翻訳(概要): Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

論文の概要: Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2512.06250v1
Date: Sat, 06 Dec 2025 02:50:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.262364
Title: Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning
Title（参考訳）: いつ切り替えるかを学ぶ:強化学習による適応的政策選択
Authors: Chris Tava,
Abstract要約: この研究は、エージェントが系統的な探索(カバレッジ)と目標指向のパスフィニング(収束)を動的に移行してタスクのパフォーマンスを向上させる方法を示す。固定閾値アプローチとは異なり、エージェントはQラーニングを使用して、カバレッジパーセンテージと目標までの距離に基づいてスイッチング行動に適応する。その結果、完了時間が23～55%改善され、ランタイムの分散が83%減少し、最悪のシナリオでは71%改善した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents often require multiple strategies to solve complex tasks, but determining when to switch between strategies remains challenging. This research introduces a reinforcement learning technique to learn switching thresholds between two orthogonal navigation policies. Using maze navigation as a case study, this work demonstrates how an agent can dynamically transition between systematic exploration (coverage) and goal-directed pathfinding (convergence) to improve task performance. Unlike fixed-threshold approaches, the agent uses Q-learning to adapt switching behavior based on coverage percentage and distance to goal, requiring only minimal domain knowledge: maze dimensions and target location. The agent does not require prior knowledge of wall positions, optimal threshold values, or hand-crafted heuristics; instead, it discovers effective switching strategies dynamically during each run. The agent discretizes its state space into coverage and distance buckets, then adapts which coverage threshold (20-60\%) to apply based on observed progress signals. Experiments across 240 test configurations (4 maze sizes from 16$\times$16 to 128$\times$128 $\times$ 10 unique mazes $\times$ 6 agent variants) demonstrate that adaptive threshold learning outperforms both single-strategy agents and fixed 40\% threshold baselines. Results show 23-55\% improvements in completion time, 83\% reduction in runtime variance, and 71\% improvement in worst-case scenarios. The learned switching behavior generalizes within each size class to unseen wall configurations. Performance gains scale with problem complexity: 23\% improvement for 16$\times$16 mazes, 34\% for 32$\times$32, and 55\% for 64$\times$64, demonstrating that as the space of possible maze structures grows, the value of adaptive policy selection over fixed heuristics increases proportionally.
Abstract（参考訳）: 自律エージェントは複雑なタスクを解決するために複数の戦略を必要とすることが多いが、戦略を切り替えるタイミングを決定することは依然として難しい。本研究では,2つの直交ナビゲーションポリシー間の切替閾値を学習するための強化学習手法を提案する。迷路ナビゲーションをケーススタディとして、エージェントが系統的な探索(カバレッジ)と目標指向パスフィンディング(収束)を動的に移行してタスクパフォーマンスを向上させる方法を示す。固定閾値アプローチとは異なり、エージェントはQラーニングを使用して、範囲のパーセンテージと目標までの距離に基づいて切り替え行動を適用する。エージェントは、壁の位置、最適なしきい値、手作りのヒューリスティックスの事前知識を必要としない。エージェントは、状態空間をカバーと距離バケットに識別し、観察された進行信号に基づいてどのカバレッジ閾値(20〜60\%)を適用させる。 16$\times$16から128$\times$128 $\times$10 unique mazes$\times$6 agent variants) 240のテスト構成(4つのmazeサイズから16$\times$16から128$\times$128 $\times$10 unique mazes$\times$6 agent variants)にわたる実験では、適応しきい値学習が単一戦略エージェントと固定された40\%のしきい値ベースラインの両方を上回ることが示されている。その結果,23～55倍の完成時間の改善,83倍のランタイム分散,71倍の最悪のシナリオが得られた。学習したスイッチング動作は、各サイズクラス内で、見えない壁の構成に一般化される。 16$\times$16の迷路に対する23\%の改善、32$\times$32の34\%、64$\times$64の55\%の改善。

論文の概要: Learning When to Switch: Adaptive Policy Selection via Reinforcement Learning

関連論文リスト