Fugu-MT 論文翻訳(概要): Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

論文の概要: Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

arxiv url: http://arxiv.org/abs/2511.03695v1
Date: Wed, 05 Nov 2025 18:20:23 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.517782
Title: Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL
Title（参考訳）: 行動適応型Q-Learning:オフライン対オンラインRLのための統一フレームワーク
Authors: Lipeng Zu, Hansong Zhou, Xiaonan Zhang,
Abstract要約: 本稿では,オフラインからオンラインRLへのスムーズな移行を可能にするフレームワークである行動適応型Q-Learning(BAQ)を紹介する。 BAQは、(i)不確実性が高い場合のオフライン行動に対してオンラインポリシーを整列させ、(ii)より確実なオンライン体験が蓄積されるにつれて、この制約を徐々に緩和する二重目的損失を包含する。標準ベンチマーク全体を通じて、BAQは、オフラインからオフラインまでのRLアプローチを一貫して上回り、より高速なリカバリ、堅牢性の向上、全体的なパフォーマンスの向上を実現している。
参考スコア（独自算出の注目度）: 3.2883573376133555
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Offline reinforcement learning (RL) enables training from fixed data without online interaction, but policies learned offline often struggle when deployed in dynamic environments due to distributional shift and unreliable value estimates on unseen state-action pairs. We introduce Behavior-Adaptive Q-Learning (BAQ), a framework designed to enable a smooth and reliable transition from offline to online RL. The key idea is to leverage an implicit behavioral model derived from offline data to provide a behavior-consistency signal during online fine-tuning. BAQ incorporates a dual-objective loss that (i) aligns the online policy toward the offline behavior when uncertainty is high, and (ii) gradually relaxes this constraint as more confident online experience is accumulated. This adaptive mechanism reduces error propagation from out-of-distribution estimates, stabilizes early online updates, and accelerates adaptation to new scenarios. Across standard benchmarks, BAQ consistently outperforms prior offline-to-online RL approaches, achieving faster recovery, improved robustness, and higher overall performance. Our results demonstrate that implicit behavior adaptation is a principled and practical solution for reliable real-world policy deployment.
Abstract（参考訳）: オフライン強化学習(RL)は、オンラインインタラクションを伴わない固定データからのトレーニングを可能にするが、分散シフトと、目に見えない状態-動作ペアに対する信頼できない値推定のために、動的環境にデプロイされたときにオフラインで学習するポリシーは、しばしば苦労する。本稿では,オフラインからオンラインRLへのスムーズで信頼性の高い移行を可能にするフレームワークである,行動適応型Q-Learning(BAQ)を紹介する。鍵となる考え方は、オフラインデータから派生した暗黙の行動モデルを活用して、オンラインの微調整中に行動一貫性信号を提供することである。 BAQは二重目的損失を組み込む (i)不確実性が高い場合のオフライン行動に対するオンライン政策の整合性、及び (二)より自信のあるオンライン体験が蓄積されるにつれて、この制約は徐々に緩和される。この適応メカニズムは、アウト・オブ・ディストリビューション推定からのエラー伝搬を低減し、早期オンライン更新を安定化し、新しいシナリオへの適応を加速する。標準ベンチマーク全体を通じて、BAQは、オフラインからオフラインまでのRLアプローチを一貫して上回り、より高速なリカバリ、堅牢性の向上、全体的なパフォーマンスの向上を実現している。この結果から,暗黙の行動適応は信頼性の高い実世界の政策展開のための原則的かつ実践的な解決策であることが示された。

論文の概要: Behavior-Adaptive Q-Learning: A Unifying Framework for Offline-to-Online RL

関連論文リスト