Fugu-MT 論文翻訳(概要): Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

論文の概要: Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

arxiv url: http://arxiv.org/abs/2602.18312v1
Date: Fri, 20 Feb 2026 16:11:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:41.598284
Title: Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty
Title（参考訳）: 行動ヤコビの罰則を用いた平滑な時変線形ポリシーの学習
Authors: Zhaoming Xie, Kevin Karol, Jessica Hodgins,
Abstract要約: 強化学習は、シミュレートされた文字に対する多様な動きを再現できる制御ポリシーを学習するためのフレームワークを提供する。既存の作業は、時間とともに大きなアクション変更を罰する報酬項を追加することで、この問題に対処する。本稿では, 自己分化による模擬状態の変化に対して, 行動変化を罰する行動ヤコビのペナルティを提案する。
参考スコア（独自算出の注目度）: 1.8122712065585906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning provides a framework for learning control policies that can reproduce diverse motions for simulated characters. However, such policies often exploit unnatural high-frequency signals that are unachievable by humans or physical robots, making them poor representations of real-world behaviors. Existing work addresses this issue by adding a reward term that penalizes a large change in actions over time. This term often requires substantial tuning efforts. We propose to use the action Jacobian penalty, which penalizes changes in action with respect to the changes in simulated state directly through auto differentiation. This effectively eliminates unrealistic high-frequency control signals without task specific tuning. While effective, the action Jacobian penalty introduces significant computational overhead when used with traditional fully connected neural network architectures. To mitigate this, we introduce a new architecture called a Linear Policy Net (LPN) that significantly reduces the computational burden for calculating the action Jacobian penalty during training. In addition, a LPN requires no parameter tuning, exhibits faster learning convergence compared to baseline methods, and can be more efficiently queried during inference time compared to a fully connected neural network. We demonstrate that a Linear Policy Net, combined with the action Jacobian penalty, is able to learn policies that generate smooth signals while solving a number of motion imitation tasks with different characteristics, including dynamic motions such as a backflip and various challenging parkour skills. Finally, we apply this approach to create policies for dynamic motions on a physical quadrupedal robot equipped with an arm.
Abstract（参考訳）: 強化学習は、シミュレートされた文字に対する多様な動きを再現できる制御ポリシーを学習するためのフレームワークを提供する。しかし、そのような政策はしばしば人間や物理的なロボットによって実現不可能な非自然の高周波信号を利用しており、現実世界の行動の表現が貧弱である。既存の作業は、時間とともに大きなアクション変更を罰する報酬項を追加することで、この問題に対処する。この用語は、しばしば実質的なチューニングの努力を必要とする。本稿では, 自己分化による模擬状態の変化に対して, 行動変化を罰する行動ヤコビのペナルティを提案する。これにより、タスク固有のチューニングなしで非現実的な高周波制御信号を効果的に排除できる。効果はあるが、行動ヤコビアンペナルティは、従来の完全に接続されたニューラルネットワークアーキテクチャで使用する場合、計算オーバーヘッドが大幅に増加する。これを緩和するために,リニアポリシーネット(LPN)と呼ばれる新しいアーキテクチャを導入し,トレーニング中の行動ヤコビのペナルティを計算する際の計算負担を大幅に削減する。さらに、LPNはパラメータチューニングを必要とせず、ベースライン法よりも高速な学習収束を示し、完全に接続されたニューラルネットワークと比較して推論時間中により効率的にクエリすることができる。本研究では,リニアポリシーネットと行動ヤコビのペナルティが組み合わさって,バックフリップや多様なパーキングスキルなど,さまざまな特徴を持つ動作模倣タスクを解きながら,スムーズな信号を生成する政策を学習できることを実証する。最後に、本手法を用いて、腕を備えた身体四足歩行ロボットの動的動作のポリシーを作成する。

論文の概要: Learning Smooth Time-Varying Linear Policies with an Action Jacobian Penalty

関連論文リスト