Fugu-MT 論文翻訳(概要): Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure

論文の概要: Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure

arxiv url: http://arxiv.org/abs/2603.22384v2
Date: Thu, 26 Mar 2026 16:30:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 18:28:14.971934
Title: Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure
Title（参考訳）: 行動の学習:時間的時間的構造を考慮した時間的認識強化学習
Authors: Davide Di Gioia,
Abstract要約: 本稿では,経験から認知的ティッチ間の最適間隔を学習する,軽量な時間制御システムを提案する。また、選択した待ち時間に対する非効率性を明示するインターバルアウェア報酬を提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents operating in continuous environments must decide not only what to do, but when to act. We introduce a lightweight adaptive temporal control system that learns the optimal interval between cognitive ticks from experience, replacing ad hoc biologically inspired timers with a principled learned policy. The policy state is augmented with a predictive hyperbolic spread signal (a "curvature signal" shorthand) derived from hyperbolic geometry: the mean pairwise Poincare distance among n sampled futures embedded in the Poincare ball. High spread indicates a branching, uncertain future and drives the agent to act sooner; low spread signals predictability and permits longer rest intervals. We further propose an interval-aware reward that explicitly penalises inefficiency relative to the chosen wait time, correcting a systematic credit-assignment failure of naive outcome-based rewards in timing problems. We additionally introduce a joint spatio-temporal embedding (ATCPG-ST) that concatenates independently normalised state and position projections in the Poincare ball; spatial trajectory divergence provides an independent timing signal unavailable to the state-only variant (ATCPG-SO). This extension raises mean hyperbolic spread (kappa) from 1.88 to 3.37 and yields a further 5.8 percent efficiency gain over the state-only baseline. Ablation experiments across five random seeds demonstrate that (i) learning is the dominant efficiency factor (54.8 percent over no-learning), (ii) hyperbolic spread provides significant complementary gain (26.2 percent over geometry-free control), (iii) the combined system achieves 22.8 percent efficiency over the fixed-interval baseline, and (iv) adding spatial position information to the spread embedding yields an additional 5.8 percent.
Abstract（参考訳）: 継続的環境で動作する自律エージェントは、何をすべきかだけでなく、いつ行動すべきかを判断しなければならない。本稿では,生物学的にヒントを得たタイマーを原則的に学習ポリシーに置き換え,認知的ティッチ間の最適間隔を経験から学習する,軽量適応型時間制御システムを提案する。ポリシー状態は、双曲幾何学から導かれる予測的双曲拡散信号(「曲率信号」の略)で拡張される。高拡散は分岐した不確実な未来を示し、エージェントがより早く行動するよう促す。さらに,選択した待ち時間に対する非効率性を明示するインターバルアウェア報酬を提案し,タイミング問題におけるナイーブな結果に基づく報酬の体系的なクレジット割り当て失敗を補正する。また,ポインケア球の正規化状態と位置投影を独立に結合する結合時空間埋め込み (ATCPG-ST) を導入し,空間軌道の発散により,状態のみの変種 (ATCPG-SO) では利用できない独立したタイミング信号を提供する。この拡張により、平均的な双曲拡散(カッパ)が1.88から3.37に上昇し、州のみのベースラインよりもさらに5.8%効率が向上する。 5つのランダムな種子に対するアブレーション実験は、そのことを証明している (i)学習が主な効率因子(非学習率54.8%)である。 (II)双曲スプレッドは、(幾何学的自由制御よりも26.2%)大きな相補的な利得をもたらす。三固定インターバルベースラインの効率を22.8%向上させ、 (4)スプレッド埋め込みに空間位置情報を追加すると、さらに5.8%となる。

論文の概要: Learning When to Act: Interval-Aware Reinforcement Learning with Predictive Temporal Structure

関連論文リスト