Fugu-MT 論文翻訳(概要): When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

論文の概要: When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

arxiv url: http://arxiv.org/abs/2606.18531v1
Date: Tue, 16 Jun 2026 22:55:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.926303
Title: When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?
Title（参考訳）: 軌道レベルスーパービジョンはオフライン強化学習に有効か?
Authors: Xuanfei Ren, Tengyang Xie,
Abstract要約: 我々は、結果レベルの監督からオフライン政策最適化の統計理論を開発する。 OPACは、報酬モデルを学び、軌道レベルのラベルからポリシーを最適化する悲観的なアクターアルゴリズムである。一般化された結果に基づくオフラインRLについて検討し,各ステップ当たりの報酬の非線形集約によって誘導されるトラジェクトリレベルの量について検討した。
参考スコア（独自算出の注目度）: 10.71573326176278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline reinforcement learning is typically analyzed under process-level reward supervision, yet many sequential decision datasets record only trajectory-level outcomes. We develop a statistical theory for offline policy optimization from such outcome-level supervision. We first study the canonical setting where the target remains the expected cumulative reward, but each offline trajectory provides only a scalar label whose conditional mean is the cumulative return. We propose OPAC, a pessimistic actor-critic algorithm that learns a latent reward model and optimizes a policy from trajectory-level labels. We prove a high-probability guarantee of order $\widetilde O(H^2\sqrt{C_{sa}(π^\star)/n})$ and a matching lower bound, characterizing the sharp statistical cost of replacing process-level rewards with one trajectory-level label. We then extend the principle to preference-based feedback, preserving the leading horizon and concentrability dependence up to preference-model constants. Finally, we study generalized outcome-based offline RL, where both the supervision and the objective are trajectory-level quantities induced by a nonlinear aggregation of latent per-step rewards. This problem is not learnable in general: for all-success objectives, any offline learner may require $Ω(2^H)$ trajectories even with deterministic transitions and constant concentrability. We then identify a tractable regime through two structural coefficients, $κ_μ(σ)$ and $χ_μ(σ)$, capturing information loss in outcome aggregation and generalized Bellman updates, under which generalized OPAC achieves polynomial sample complexity. Together, our results delineate when outcome-level supervision enables sample-efficient offline control and when missing process-level rewards create fundamental statistical barriers.
Abstract（参考訳）: オフライン強化学習は通常、プロセスレベルの報酬管理の下で分析されるが、多くのシーケンシャルな決定データセットは軌道レベルの結果のみを記録する。我々は、このような結果レベルの監督からオフライン政策最適化の統計理論を開発する。まず、目標が期待される累積報酬を継続する標準条件について検討するが、各オフライン軌道は、条件平均が累積リターンであるスカラーラベルのみを提供する。我々は,遅延報酬モデルを学び,軌道レベルのラベルからポリシーを最適化する悲観的アクター批判アルゴリズムOPACを提案する。我々は,プロセスレベルの報酬を1つのトラジェクトリレベルラベルに置き換える際の統計的コストを特徴付けるために,オーダー$\widetilde O(H^2\sqrt{C_{sa}(π^\star)/n})$と一致した下位境界の高確率保証を証明した。次に、この原理を嗜好に基づくフィードバックに拡張し、先行する地平線を保ち、嗜好モデル定数まで集中性に依存する。最後に、一般化結果に基づくオフラインRLについて検討し、各ステップ当たりの報酬の非線形集約により、監督と目的の両方が軌道レベルの量であることを示す。この問題は一般には学べない: すべての難解な目的に対して、任意のオフライン学習者は決定論的遷移と一定の集中性があるにもかかわらず、$Ω(2^H)$ trajectoriesを必要とする。次に,2つの構造係数,$κ_μ(σ)$と$\_μ(σ)$で抽出可能な状態を特定し,結果の集合における情報損失を捉え,一般化されたベルマン更新を行い,一般化されたOPACが多項式サンプルの複雑性を達成する。結果から,結果レベルの監視がサンプリング効率のよいオフライン制御を可能にし,プロセスレベルの報酬が欠落すると,基本的な統計的障壁が生じることがわかった。

論文の概要: When Does Trajectory-Level Supervision Permit Efficient Offline Reinforcement Learning?

関連論文リスト