Fugu-MT 論文翻訳(概要): Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

論文の概要: Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

arxiv url: http://arxiv.org/abs/2605.13554v1
Date: Wed, 13 May 2026 13:58:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.089968
Title: Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation
Title（参考訳）: コントラスト的政策最適化による自己監督型オン・ポリティクス強化学習
Authors: Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius,
Abstract要約: CPPO(Contrastive Proximal Policy optimisation)を紹介する。 CPPOは、コントラストQ値から直接ポリシーの利点を導き出す、政治上のコントラストRLアルゴリズムである。連続・離散・単エージェント・協調マルチエージェントタスクにおけるCPPOの評価を行った。
参考スコア（独自算出の注目度）: 3.8479372725359418
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}
Abstract（参考訳）: コントラスト強化学習(CRL)は、状態行動や目標表現に対する対照的な目的を通じて、目標条件付きQ値を学び、手作りの報酬関数の必要性を取り除く。 RLで実行可能な自己教師型学習を実現するという驚くべき成功にもかかわらず、既存のCRLアルゴリズムはすべて、政治外の最適化に依存しており、主に連続的な行動空間に制約されており、離散環境にはほとんど投資されていない。これによりCRLは、単一エージェントと複数エージェントのRLの両方で、連続的および離散的な環境で採用されている、広く使用され、効果的で近代的なオンライントレーニングパイプラインから切り離される。第1の接続を確立するため,我々はCPPO(Contrastive Proximal Policy Optimisation)を導入する。 CPPOは、コントラストQ値から直接ポリシーの利点を導出し、報酬関数やリプレイバッファを必要とせず、標準のPPO目標によって最適化する、政治上のコントラストRLアルゴリズムである。連続・離散・単エージェント・協調マルチエージェントタスクにおけるCPPOの評価を行った。オンライン型アプローチの存在は本質的に有用であるが,従来のCRLベースラインを18タスク中14タスクで大きく上回るだけでなく,手作りの高密度報酬を用いたPPOのパフォーマンスを18タスク中12タスク中12タスクで上回っている。 ※

論文の概要: Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

関連論文リスト