Fugu-MT 論文翻訳(概要): FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

論文の概要: FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

arxiv url: http://arxiv.org/abs/2604.04539v1
Date: Mon, 06 Apr 2026 09:03:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.154464
Title: FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Title（参考訳）: FlashSAC:高次元ロボット制御のための高速で安定なオフポリティ強化学習
Authors: Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee,
Abstract要約: 強化学習(Reinforcement Learning, RL)は、専門家によるデモンストレーションが利用できない場合のロボット制御における中核的なアプローチである。我々は,Soft Actor-Critic上に構築された高速で安定なオフポリチィRLアルゴリズムであるFlashSACを提案する。 10のシミュレータで60以上のタスクをこなし、FlashSACは最終的なパフォーマンスとトレーニング効率の両方において、PPOと強力なオフポリシーベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 55.38832429564216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は、専門家によるデモンストレーションが利用できない場合のロボット制御における中核的なアプローチである。 PPO(Proximal Policy Optimization)のようなオン・ポリティクス法は、その安定性に広く用いられているが、限定的なオン・ポリティクスデータへの依存は、高次元の状態や行動空間における正確な政策評価を制限している。オフ・ポリティクス法は、より広範な状態-行動分布から学習することでこの制限を克服することができるが、様々なデータに値関数を適合させるには多くの勾配更新が必要であるため、ブートストラップによって批判的エラーが蓄積されるため、収束と不安定が遅くなる。我々は,Soft Actor-Critic上に構築された高速で安定なオフポリチィRLアルゴリズムであるFlashSACを提案する。教師付き学習で観察されるスケーリング法則によって動機づけられたFlashSACは、より大きなモデルとより高いデータスループットで補償しながら、勾配更新を劇的に削減する。スケールの安定を維持するために、FlashSACは明らかに重み、特徴、勾配のノルムを制限し、批判的なエラーの蓄積を抑制する。 10シミュレータの60以上のタスクで、FlashSACは、最終的なパフォーマンスとトレーニングの効率の両方において、PPOと強力なオフポリシーベースラインを一貫して上回り、デクスタラスな操作のような高次元タスクで最大の利益を上げている。 sim-to-realのヒューマノイドローコモーションでは、FlashSACはトレーニング時間を数時間から数分に短縮し、sim-to-real転送のためのオフポリシーRLの約束を示す。

論文の概要: FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

関連論文リスト