Marco Pleines, Konstantin Ramthun, Yannik Wegener, Hendrik Meyer,
Matthias Pallasch, Sebastian Prior, Jannik Dr\"ogem\"uller, Leon
B\"uttinghaus, Thilo R\"othemeyer, Alexander Kaschwig, Oliver Chmurzynski,
Frederik Rohkr\"ahmer, Roman Kalkreuth, Frank Zimmer, Mike Preuss
Autonomously trained agents that are supposed to play video games reasonably
well rely either on fast simulation speeds or heavy parallelization across
thousands of machines running concurrently. This work explores a third way that
is established in robotics, namely sim-to-real transfer, or if the game is
considered a simulation itself, sim-to-sim transfer. In the case of Rocket
League, we demonstrate that single behaviors of goalies and strikers can be
successfully learned using Deep Reinforcement Learning in the simulation
environment and transferred back to the original game. Although the implemented
training simulation is to some extent inaccurate, the goalkeeping agent saves
nearly 100% of its faced shots once transferred, while the striking agent
scores in about 75% of cases. Therefore, the trained agent is robust enough and
able to generalize to the target domain of Rocket League.
Oliver Chmurzynski, Frederik Rohkr¨ahmer, Roman Kalkreuth, Frank Zimmer∗, Mike Preuss†
Oliver Chmurzynski, Frederik Rohkr sahmer, Roman Kalkreuth, Frank Zimmer∗, Mike Preuss.
0.45
∗Department of Communication and Environment, Rhine-Waal University of Applied Sciences, Kamp-Linfort, Germany
ドイツ・カンプリンフォートのライン=ワール応用科学大学 ∗ 通信環境部
0.51
Department of Computer Science, TU Dortmund University, Dortmund, Germany
ドイツのドルトムントにあるツ・ドルトムント大学コンピュータ科学科
0.50
†LIACS Universiteit Leiden, Leiden, Netherlands
オランダ・ライデン大学ライデン校
0.31
Abstract—Autonomously trained agents that are supposed to play video games reasonably well rely either on fast simulation speeds or heavy parallelization across thousands of machines running concurrently.
This work explores a third way that is established in robotics, namely sim-to-real transfer, or if the game is considered a simulation itself, sim-to-sim transfer.
In the case of Rocket League, we demonstrate that single behaviors of goalies and strikers can be successfully learned using Deep Reinforcement Learning in the simulation environment and transferred back to the original game.
Although the implemented training simulation is to some extent inaccurate, the goalkeeping agent saves nearly 100% of its faced shots once transferred, while the striking agent scores in about 75% of cases.
Index Terms—rocket league, sim-to-sim transfer, deep rein-
インデックス用語 -rocket league, sim-to-sim transfer, deep rein-
0.64
forcement learning, proximal policy optimization
強制学習 親密な政策最適化
0.72
I. INTRODUCTION
I. イントロダクション
0.64
The spectacular successes of agents playing considerably difficult games, such as StarCraft II [1] and DotA 2 [2], have been possible only because the employed algorithms were able to train on huge numbers of games on the order of billions or more.
Unfortunately, and despite many improvements achieved in AI in recent years, the utilized Deep Learning methods are still relatively sample inefficient.
To deal with this problem, fast running environments or high amounts of computing resources are vital.
この問題に対処するためには、高速実行環境や大量のコンピューティングリソースが不可欠である。
0.74
OpenAI Five for DotA 2 [2] is an example of the utilization of hundreds of thousands of computing cores in order to achieve high throughput in terms of played games.
OpenAI Five for DotA 2 [2]は、プレイゲームにおいて高いスループットを達成するために、数十万のコンピューティングコアを利用する例である。
0.77
However, this way is closed for games that run only on specific platforms and are thus very hard to parallelize.
Therefore it makes sense to look for alternative ways to tackle difficult problems.
したがって、難しい問題に取り組むための代替方法を探すのは理にかなっている。
0.63
Sim-to-real transfer offers such an alternative way and is well established in robotics, and it follows the general idea that robot behavior can be learned in a very simplified simulation environment and the trained agents can then be successfully transferred to the original environment.
sim-to-real transferはそのような代替手段を提供し、ロボット工学においてよく確立されており、ロボットの動作を非常に単純なシミュレーション環境で学ぶことができ、訓練されたエージェントを元の環境にうまく移すことができるという一般的な考え方に従っている。 訳抜け防止モード: Sim - to - real transferはそのような代替手段を提供し、ロボティクスで十分に確立されている。 そしてそれは ロボットの動作は 非常に単純なシミュレーション環境で 訓練されたエージェントを 元の環境に移すことができます
0.86
If the target platform is a game as well, we may speak of sim-to-sim transfer because the original game is also virtual, just computationally Accepted to IEEE CoG 2022
This approach is applicable to current games, even if they are not parallelizable, and makes them available for modern Deep Reinforcement Learning (DRL) methods.
There is of course a downside of this approach, namely that it may be difficult or even infeasible to establish a simulation that is similar enough to enable transfer later on, but still simple enough to speed up learning significantly.
Therefore we aim to explore the possibilities of this direction in order to detect how simple the simulation can be, and how good the transfer to the original game works.
The game we choose as a test case of the sim-to-sim approach is Rocket League (Figure 1), which basically resembles indoor football with cars and teams of 3.
Rocket league is freely available for Windows and Mac, possesses a bot API
Rocket LeagueがWindowsとMacで無料で利用可能、ボットAPIを保有
0.82
英語(論文から抽出)
日本語訳
スコア
(RLBot [4]) and a community of bot developers next to a large human player base.
(RLBot [4])と、大きな人間のプレイヤーベースの横にあるボット開発者のコミュニティ。
0.79
As the 3 members of each team control car avatars with physical properties different from human runners, the overall tactics are the one of rotation without fixed roles.
Next to basic abilities attempting to shoot towards the goal and to move the goalie in order to prevent a goal, Rocket League is a minimal team AI setting [6] where layers of team tactics and strategy can be learned.
The first step of our work re-implements not all, but multiple physical gameplay mechanics of Rocket League using the game engine Unity, which results in a slightly inaccurate simulation.
The learned behaviors are then transferred to Rocket League for evaluation.
学習された行動は評価のためにロケットリーグに移される。
0.66
Even though the training simulation is imperfect, the transferred behaviors are robust enough to succeed at their tasks by generalizing to the domain of Rocket League.
Before concluding our work, a discussion is provided.
私たちの仕事をまとめる前に議論がある。
0.73
II. RELATED WORK Sim-to-sim transfer on a popular multiplayer team video game touches majorly on two different areas, namely multiagent and sim-to-real transfer.
As this work focuses on single-agent environments, namely the goalkeeper and striker environments, related work on sim-to-real transfer is focused next.
The competition is known for the robot soccer cup but also includes other challenges.
この競技会はロボットサッカーカップで知られているが、他の課題も含んでいる。
0.60
Reinforcement Learning (RL) has been successfully applied to simulated robot soccer in the past [9] and has been found a powerful method for tackling robot soccer.
A recent survey [10] provides insights into robot soccer and highlights significant trends, which briefly mention the transfer from simulation to the real world.
In general, sim-to-real transfer is a well-established method for robot learning and is widely used in combination with RL.
一般に、sim-to-real転送はロボット学習の確立された方法であり、RLと組み合わせて広く使われている。 訳抜け防止モード: 一般に sim -to - real transfer is well- established method for robot learning RLと組み合わせて広く使われている。
0.83
It allows the transition of an RL agent’s behavior, which has been trained in simulations, to real-world environments.
これにより、シミュレーションで訓練されたrlエージェントの動作を実環境へ移行することができる。
0.70
Simto-real transfer has been predominantly applied to RL-based robotics [11] where the robotic agent has been trained with
Popular applications for sim-to-real transfer in robotics have been autonomous racing [12], Robot Soccer [13], navigation [14], and control tasks [15].
To address the inability to exactly match the realworld environment, a challenge commonly known as sim-toreal gap, steps have also been taken towards generalized simto-real transfer for robot learning [16], [17].
The translation of synthetic images to realistic ones at is employed by a method called GraspGAN [18] which utilizes a generative adversarial network (GAN) [19].
合成画像から現実画像への変換は, GAN (generative adversarial network) を用いた GraspGAN [18] と呼ばれる手法を用いて行われる。
0.73
GANs are able to generate synthetic data with good generalization ability.
GANは、優れた一般化能力を持つ合成データを生成することができる。
0.62
This property can be used for image synthesis to model the transformation between simulated and real images.
この特性は、シミュレーション画像と実画像の間の変換をモデル化するために画像合成に使用できる。
0.77
GraspGAN provides a method called pixel-level domain adaptation, which translates synthetic images to realistic ones at the pixel level.
level Another approach to narrow the sim-to-real gap is domain randomization [20].
レベル sim-to-realギャップを狭める別のアプローチは、ドメインランダム化[20]である。
0.66
Its goal is to train the agent in plenty of randomized domains to generalize to the real domain.
その目標は、多数のランダムなドメインでエージェントを訓練し、実際のドメインに一般化することである。
0.67
By randomizing all physical properties and visual appearances during training in the simulation, a trained behavior was successfully transferred to the real world to solve the Rubik’s cube [21].
This section starts out by providing an overview of vital components of Rocket League’s physical gameplay mechanics, which are implemented in the training simulation based on the game engine Unity and the ML-Agents Toolkit [22].
Afterward, the DRL environments, designated for training, and their properties are detailed.
その後、訓練用に指定されたDRL環境とその特性について詳述する。
0.73
The code is open source1.
コードはopen source1です。
0.83
A. Implementation of the Training Simulation
A. 訓練シミュレーションの実装
0.67
The implementation of the Unity simulation originates from the so called RoboLeague repository [3].
Unityシミュレーションの実装は、RoboLeagueリポジトリ[3]と呼ばれるものです。
0.66
As this version of the simulation is by far incomplete and inaccurate, multiple fundamental aspects and concepts are implemented, which are essentially based on the physical specifications of Rocket League.
Therefore, table I provides an overview of all that was considered during implementing essential physical components, while highlighting distinct adjustments that differ from the information provided by the references.
Additional Information and Different Parameters Car model Octane and its collision mesh is used Radius of the ball is set to 93.15uu (value in Rocket League 92.75uu) No modifications done Raise max.
angular velocity during dodge from 5.5 rad s Adjust drag coefficients for roll to −4.75 and pitch to −2.85 Used for the ball-to-car interaction and car-to-car interaction.
ドッジ中の角速度は5.5 rad sでロールのドラッグ係数を −4.75 に、ピッチを −2.85 に調整する。
0.66
The impulse by the bullet engine replaces the Unity one.
弾丸エンジンによるインパルスは、Unityに取って代わる。
0.71
Psyonix impulse is an additional impulse on the center of the ball, which allows a better prediction and control of collisions.
Implemented using the Bullet and Psyonix impulses, but not thoroughly tested Implemented, but not thoroughly tested and hence not considered in this paper
to 7.3 rad s s2 is used, which is reduced by more than half when the car is upside down.
7.3 rad s s2は車体が逆さまになると半分以上減少する。 訳抜け防止モード: 7.3 rad s s2が使われていて 車が逆さまになると 半分以上減少します
0.59
Fig. 2. The physical maneuver of a dodge roll is executed to exemplary show the alignment of the Unity simulation to the ground truth by using different max angular velocities.
that most measures are given in unreal units (uu).
ほとんどの測度は非現実単位 (uu) で与えられる。
0.67
To convert them to Unity’s scale, these have to be divided by 100.
それらをUnityのスケールに変換するには、これらを100に分割する必要がある。
0.69
Some adjustments are based on empirical findings by comparing the outcome of distinct physical maneuvers inside the implemented training simulation and the ground truth provided by Rocket League.
While the simulation is conducted in both simulations, multiple relevant game state variables like positions, rotations, and velocities are monitored for later evaluation.
Figure 2 is an example where the physical maneuver orders the car to execute a dodge roll.
図2は、物理的操作が車にドッジロールを実行するよう命令する例です。
0.79
Whereas the original max angular velocity of 5.5 rad s does not compare well to the ground truth, a more suitable value of 7.3 rad is found by analyzing s the observed data.
5.5 rad s の当初の最大角速度は基底真理とよく比較されないが、観測データの解析により、より適切な7.3 rad の値が得られる。
0.83
of about simulation training 950 steps/second, while RLBot is constrained to the real-time, where only 120 steps/second are possible.
to fully observe its environment as illustrated by figure 3.
図3に示すように環境を完全に観察する。
0.84
The agent’s action space is multi-discrete and contains the following 8 dimensions: • Throttle (5 actions) • Steer (5 actions) • Yaw (5 actions) • Pitch (5 actions)
Moreover, multi-discrete action spaces allow the execution of concurrent actions.
さらに、複数の離散アクション空間は同時アクションの実行を可能にする。
0.62
One discrete action dimension could achieve the same behavior.
1つの個別の行動次元は同じ振る舞いを達成できる。
0.59
This would require defining actions that feature every permutation of the available actions.
これは利用可能なアクションのすべての置換を特徴とするアクションを定義する必要がある。
0.57
As a consequence, the already high-dimensional action space of Rocket League would be much larger and therefore harder to train.
その結果、ロケットリーグの既に高次元のアクションスペースははるかに大きく、訓練が困難になる。
0.70
IV. DEEP REINFORCEMENT LEARNING
IV。 深層強化学習
0.43
The actor-critic, on-policy algorithm PPO [7] and its clipped surrogate objective (Equation 1) is used to train the agent’s policy π, with respect to its model parameters θ, inside the Unity simulation.
PPO, algorithmic details, and the model architecture are presented next.
次に、PPO、アルゴリズムの詳細、モデルアーキテクチャを示す。
0.62
A. Proximal Policy Optimization t (θ) denotes the policy objective, which optimizes the LC probability ratio of the current policy πθ and the old one πθold: (1)
A. 政策最適化 t (θ) は政策目標を表し、現在の政策 πθ と古い政策 πθold: (1) の lc 確率比を最適化する。
0.86
t (θ) = ˆEt[min(qt(θ) ˆAt, clip(qt(θ), 1 − , 1 + ) ˆAt)] LC with the surrogate objective qt(θ) =
t (θ) = set[min(qt(θ) sat, clip(qt(θ), 1 − s, 1 + s) sat)] lc with the surrogate objective qt(θ) = with the surrogate objective qt(θ) = 訳抜け防止モード: t ( θ ) = >Et[min(qt(θ ) >At, clip(qt(θ ) 代理対象 qt(θ ) = 1 − θ , 1 + θ ) > At ) ] LC である。
0.83
πθ(at|st) πθold(at|st)
πθ(at|st) πθold(at|st)
0.35
st is the environment’s state at step t.
stは、ステップtにおける環境の状態です。
0.79
at is an action tuple, which is executed by the agent, while being in st. The clipping range is stated by and ˆAt is the advantage, which is computed using generalized advantage estimation [30].
While computing t of the value function, the maximum the squared error loss LV between the default and the clipped error loss is determined.
値関数tの演算中に、デフォルトとクリップされた誤差損失との2乗誤差損失LVの最大値を決定する。
0.80
(2) (3) t = Vθold(st) + clip(Vθ(st) − Vθold(st),−, ) V C
(2) (3) t = vθold(st) + clip(vθ(st) − vθold(st), −s, ) v c
0.44
t − Gt)2) t = max((Vθ(st) − Gt)2, (V C LV with the sampled return Gt = Vθold(st) + ˆAt
t − Gt)2) t = max((Vθ(st) − Gt)2, (V C LV with the sampled return Gt = Vθold(st) + >At
0.47
The final objective is established by LCV H
最終目標はLCV Hによって確立される
0.76
(θ): t LCV H
(θ): t LCV H
0.43
t (θ) = ˆEt[LC (4) the entropy bonus H[πθ](st) is To encourage exploration, added and weighted by the coefficient c2.
t (θ) = set[lc (4) エントロピーボーナス h[πθ](st) は係数 c2 による探索、付加、重み付けを促進するためである。
0.55
Weighting is also applied to the value loss using c1.
c1を用いた値損失にも重み付けが適用される。
0.70
t (θ) + c2H[πθ](st)]
t (θ) + c2H[πθ](st)]
0.42
t (θ) − c1LV
t (θ) − c1LV
0.46
Fig. 4. The policy and the value function share gradients and several parameters.
図4。 ポリシーと値関数は勾配といくつかのパラメータを共有します。
0.65
After feeding 23 game states variables as input to the model and processing a shared fully connected layer, the network is split into a policy and value stream starting with their own fully connected layer.
PPO starts out by sampling multiple trajectories of experiences, which may contain multiple completed and truncated episodes, from a constant number of concurrent environments (i.e. workers).
The model parameters are then optimized by conducting stochastic gradient descent for several epochs of mini-batches, which are sampled from the collected data.
We further conduct an ablation study on the implemented physics where each experiment turns off one or all components.
さらに、各実験が1つまたはすべてのコンポーネントをオフにする物理に関するアブレーション研究を行う。
0.73
Turning off may also refer to use the default physics of Unity.
オンオフはまた、unityのデフォルト物理を使用することもある。
0.70
If not stated otherwise, each training run is repeated 5 times and undergoes a thorough evaluation.
そうでない場合は、各トレーニング実行を5回繰り返し、徹底的な評価を行う。
0.75
Each model checkpoint is evaluated in Unity and Rocket League by 10 training and 10 novel shots, which are repeated 3 times.
各モデルのチェックポイントは、Unity and Rocket Leagueで10のトレーニングと10の新規ショットによって評価され、3回繰り返される。
0.76
Therefore, each data point aggregates 150 episodes featuring one shot.
したがって、各データポイントは1発のショットを含む150エピソードを集約する。
0.64
Result plots Game State Variables (23)Fully Connected (256)Fully Connected (256)Fully Connected (256)Value (1)(5)(5)(5)(5)(3)(2 )(2)(2)Action DimensionsPolicyStre am ValueStream
1) Acceleration • Car drives forward and steers left and right • Car drives backward and steers left and right • Car uses boost and steers left and right
1)加速 • 車は前方、操舵は左右、• 車は後方、操舵は左右に、• 車は左右に昇降と操舵を使用する
0.72
2) Air Control
2)エアコントロール
0.81
• Car starts up in the air, looks straight up, boosts • Car starts up in the air, has an angle of 45◦, boosts
• 車は空中から始まり、まっすぐ見上げ、上昇する。• 車は空中から立ち上がり、角度は45度、上昇する。 訳抜け防止モード: • 車は空中に浮かび上がり、まっすぐに見えます。 boosts• car start up in the air, has a angle of 45 s,boosts (英語)
0.84
shortly and boosts while rolling in the air
空気中を転がりながらすぐに上昇し
0.64
shortly and boosts while rolling in the air
空気中を転がりながらすぐに上昇し
0.64
• Car starts up in the air,
•車は空中から始まります。
0.70
looks straight up and concurrently boosts, yaws, and air rolls
まっすぐに見えます 同時に、ヨー、エアロール、
0.54
3) Drift • Car drives forward for a bit and then starts turning
3)ドリフト •車は少し前進し、次に曲がり始めます。
0.77
and drifting while moving forward
前進しながらドリフトしながら
0.73
• Car drives backward for a bit and then starts turning
•車は少し後ろを運転し、それから曲がり始める
0.80
and drifting while moving forward
前進しながらドリフトしながら
0.73
• Car uses boost and then starts turning and drifting
•車はブーストを使い、回転とドリフトを開始
0.80
while using boost
boostを使いながら
0.85
4) Jump • Car makes a short jump, then a long one and at last
4)ジャンプ •車は短いジャンプをし、長いジャンプを最後に
0.71
a double jump • Car makes a front flip, a back flip and a dodge roll • Car drives forward, does a diagonal front flip and
• Ball falls down with an initial force applied on its
※ボールは、初期力で落下する
0.45
x-axis • Ball falls down with an initial force applied on its
x軸 ※ボールは、初期力で落下する
0.40
x-axis and an angular velocity 6) Shot
x軸と角速度 6) ショット
0.75
• Car drives forward and hits the motionless ball • Car drives forward and the ball rolls to the car • Ball jumps, the car jumps while boosting and hits
Afterward, the error for each data point between both simulations is measured.
その後、両シミュレーション間の各データポイントの誤差を測定する。
0.81
The final results are described by Table II, which comprises the mean, max, and standard deviation (Std) error across each run scenario.
最終結果は、各実行シナリオの平均、最大、標準偏差(Std)エラーを含む表IIによって記述される。
0.73
Letting the ball bounce for some time shows the least error, while a significant one is observed when examining the scenarios where the car shoots the ball.
The previously shown imperfections of the Unity simulation may lead to the impression that successfully transferring a trained behavior is rather unlikely.
Even though each experiment ablates all, single or no physical adaptations, the agent is still capable of saving nearly every ball once transferred to Rocket League.
Given this setting, two different policies were achieved.
この設定により、2つの異なる政策が達成された。
0.67
One policy approaches the ball as fast as possible while using a diagonal dodge roll to make the
1つのポリシーは、斜めのドッジロールを使用してボールを作る間、できるだけ速くボールに近づきます。
0.54
final touch to score. However, this behavior fails a few shots.
最終タッチで得点。 しかし、この行動は数回失敗している。
0.64
The other emerged behavior can be considered as the opposite.
他の出現した行動は反対と見なすことができる。
0.73
Depending on the distance and the height of the ball, the agent waits some time or even backs up to ensure that it will hit the ball while being on the ground.
To train multiple cooperative and competitive agents, the first obstacle that comes to mind is the tremendously high computational complexity, which might be infeasible for smaller research
0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
0510steps (数百万)0.000.250.500.751.0 0qm cum。
0.41
RewardAll On (Baseline)Unity TrainUnity EvalRocket Eval0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardAll On (Baseline)Unity TrainUnity EvalRocket Eval0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.32
RewardBullet Impulse Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardBullet Impulse Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardCustom Bounce Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardCustom Bounce Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardGround Stabilization Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardGround Stabilization Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardAll Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardAll Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.25
RewardPsyonix Impulse Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardPsyonix Impulse Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardSuspension Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardSuspension Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.25
RewardWall Stabilization Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardWall Stabilization Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardAll On (Baseline)Unity TrainUnity EvalRocket Eval0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardAll On (Baseline)Unity TrainUnity EvalRocket Eval0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.32
RewardBullet Impulse Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardBullet Impulse Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardCustom Bounce Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardCustom Bounce Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardGround Stabilization Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardGround Stabilization Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardAll Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardAll Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.25
RewardPsyonix Impulse Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardPsyonix Impulse Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.27
RewardSuspension Off0510Steps (in millions)0.000.250.5 00.751.00IQM Cum.
RewardSuspension Off0510Steps (in million)0.000.250.50 0.751.00IQM Cum
0.25
RewardWall Stabilization Off
RewardWall安定化オフ
0.83
英語(論文から抽出)
日本語訳
スコア
groups. But before going this far, several aspects need to be considered that can be treated in isolation as well.
グループ。 しかし、ここまで進む前に、分離して扱うことができるいくつかの側面を考慮する必要がある。
0.70
At last, the difficulties of training the more difficult striker environments are discussed.
最後に,より難しいストライカー環境における訓練の難しさについて考察した。
0.65
C. Difficulties of Training the harder Striker Environment
c. ハードストライカー環境の訓練の難しさ
0.82
While the goalie and the striker environment are relatively easy, the slightly more difficult striker one poses a much greater challenge due to multiple reasons:
At first, the Unity simulation is still lacking the implementation of physical concepts like the car-to-car interaction and suffers from the reported (Section V-A) inaccuracies.
At the cost of more computational resources, domain randomization [20] could achieve a more robust agent, potentially comprising an improved ability to generalize to the domain of Rocket League.
the Unity simulation still does not consider training under human conditions.
ユニティシミュレーションは、まだ人間の条件下でのトレーニングを考慮していない。
0.59
Notably, the current observation space provides perfect information on the current state of the environment, whereas players in Rocket League have to cope with imperfect information due to solely perceiving the rendered image of the game.
However, one critical concern is that the RLBot API does not reveal the rendered image of Rocket League and therefore makes a transfer impossible as of now.
The Unity simulation’s aesthetics are very abstract, whereas Rocket League impresses with multiple arenas featuring many details concerning lighting, geometry, shaders, textures, particle effects, etc..
Moreover, the multi-discrete action space used in this paper is a simplification of the original action space that features concurrent continuous and discrete actions.
Initially, the training was done using the PPO implementation of the ML-Agents toolkit [22], which supports mixed (or hybrid) concurrent action spaces.
However, these experiments were quite unstable and hindered progress.
しかし、これらの実験は非常に不安定であり、進歩を妨げた。
0.54
Therefore, Rocket League presents an interesting challenge for exploring such action spaces, of which other video games or applications are likely to take advantage.
For example, the agent could exploit such signals to cuddle with the ball at a close distance or to slowly approach the ball to maximize the cumulative return of the episode.
If those signals are turned off once the ball is touched, the value function might struggle to make further good estimates on the value of the current state of the environment, which ultimately may lead to misleading training experiences and hence an unstable learning process.
VII. CONCLUSION Towards solving Rocket League by the means of Deep Reinforcement Learning, a more sample efficient simulation is crucial, because the original game cannot be sped up and neither parallelized on Linux-based clusters.
Although the implemented simulation is not perfectly accurate, we remarkably demonstrate that transferring a trained behavior from Unity to Rocket League is robust and generalizes when dealing with a goalkeeper and striker task.
After all, Rocket League still poses further challenges when targeting a complete match under human circumstances.
ともあれ、rocket leagueは人間の状況下で完全な試合を狙うことでさらに挑戦している。
0.61
Based on our findings, we believe that Rocket League and its Unity counterpart will be valuable to various research fields and aspects, comprising: sim-to-sim transfer, partial observability, mixed action-spaces, curriculum learning, competitive and cooperative multi-agent settings.
REFERENCES [1] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C¸ .
参考 [1] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, D. Budden, Y. Sulsky, J. Mollo, T. T. Molo, L. Paine, C. C. C. 訳抜け防止モード: 参考 [1]O. Vinyals, I. Babuschkin, W. M. Czarnecki M. Mathieu, A. Dudzik, J. Chung, D. H. Choi R. Powell, T. Ewalds, P. Georgiev, J. Oh D. Horgan, M. Kroiss, I. Danihelka, A. Huang L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard D. Budden, Y. Sulsky, J. Molloy, T. L. Paine
0.47
G¨ulc¸ehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. W¨unsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “Grandmaster level in starcraft II using multiagent reinforcement learning,” Nat.
K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis, C. Apps, and D. Silver, “スタークラフトIIのグランドマスターレベルはマルチエージェント強化学習を使っている”。 訳抜け防止モード: Z. Wang, T. Pfaff, Y. Wu, R. Ring, D. Yogatama, D. W sunsch, K. McKinney, O. Smith T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, D. Hassabis C. AppsとD. Silverは、“マルチエージェント強化学習を使ったスタークラフトIIのグランドマスターレベル”だ。
0.85
, vol. 575, no. 7782, pp. 350–354, 2019.
ヴォル。 575, No. 7782, pp. 350-354, 2019。
0.63
[2] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. J´ozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang, “Dota 2 with large scale deep reinforcement learning,” CoRR, vol.
[2] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. J ́ozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang, “Dota 2 with large scale advanced learning” CoRR, vol.
0.47
abs/1912.06680, 2019.
背番号1912.06680、2019。
0.36
[3] RoboLeague, “Roboleague,” 2021, available at https://github.com/
[5] Y. Verhoeven and M. Preuss, “On the potential of rocket league for driving team ai development,” in 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 2020, pp. 2335–2342.
Y. Verhoeven and M. Preuss, “On the potential of rocket League for driving team ai development” in 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 2020, pp. 2335–2342。 訳抜け防止モード: 5 ] y. verhoeven と m. preuss は “rocket league for driving team ai development” について述べている。 2020年ieee symposium series on computational intelligence (ssci) 2020 , pp . 2335–2342 .
0.80
[6] M. Mozgovoy, M. Preuss, and R. Bidarra, “Guest editorial special issue on team AI in games,” IEEE Trans.
M. Mozgovoy, M. Preuss, R. Bidarra, “Guest Editor Special Issue on Team AI in games”, IEEE Trans. IEEE Trans 訳抜け防止モード: [6 ]M. Mozgovoy、M. Preuss、R. Bidarra ゲームにおけるチームAIに関するゲスト編集特集、IEEE Trans。
0.59
Games, vol. 13, no. 4, pp. 327–329, 2021.
ゲーム、Vol。 13 no. 4, pp. 327-329, 2021。
0.75
[7] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov
[8] H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda, and E. Osawa, “Robocup: The robot world cup initiative,” in Proceedings of the First International Conference on Autonomous Agents, AGENTS 1997, Marina del Rey, California, USA, February 5-8, 1997, W. L. Johnson, Ed.
第1回自律エージェント・エージェント国際会議1997,marina del rey, california, usa, february 5-8, 1997, w. l. johnson, edの議事録には, 北野, 朝田, y. kuniyoshi, noda, e. osawa, “robocup: the robot world cup initiative” と書かれている。
0.76
ACM, 1997, pp. 340–347.
ACM 1997, pp. 340-347。
0.84
[9] M. J. Hausknecht and P. Stone, “Deep reinforcement learning in parameterized action space,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds.
M. J. Hausknecht, P. Stone, “Deep reinforcement learning in parameterized action space” in 4th International Conference on Learning Representations, ICLR 2016 San Juan, Puerto Rico, May 2-4, 2016 Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds。 訳抜け防止モード: J. Hausknecht, P. Stone, “Deep reinforcement learning in parameterized action space, ., in 4th International Conference on Learning Representations, ICLR 2016 サンフアン、プエルトリコ、2016年5月2日 - 4日。 Y. Bengio と Y. LeCun, Eds
0.72
, 2016. [10] E. Antonioni, V. Suriani, F. Riccio, and D. Nardi, “Game strategies for physical robot soccer players: A survey,” IEEE Trans.
, 2016. E. Antonioni, V. Suriani, F. Riccio, D. Nardi, “Game Strategy for physical robot soccer players: A survey”, IEEE Trans。 訳抜け防止モード: , 2016. [10 ]E. Antonioni, V. Suriani, F. Riccio, D. Nardi, “Game Strategy for physical robot soccer players: A survey, ” IEEE Trans
0.43
Games, vol. 13, no. 4, pp. 342–357, 2021.
ゲーム、Vol。 13, No. 4, pp. 342-357, 2021。
0.80
[11] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: a survey,” in 2020 IEEE Symposium Series on Computational Intelligence, SSCI 2020, Canberra, Australia, December 1-4, 2020.
11] w. zhao, j. p. queralta, t. westerlund, “sim-to-real transfer in deep reinforcement learning for robotics: a survey” in 2020 ieee symposium series on computational intelligence, ssci 2020, canberra, australia, december 1-4, 2020 訳抜け防止モード: 11 ] w. zhao, j. p. queralta, t. westerlund. sim -to-real transfer in deep reinforcement learning for robotics : a survey」 2020年ieee symposium series on computational intelligence, ssci 2020 オーストラリア、キャンベラ、12月1日から4日、2020年。
0.82
IEEE, 2020, pp. 737–744.
IEEE, 2020, pp. 737-744。
0.91
[12] B. Balaji, S. Mallya, S. Genc, S. Gupta, L. Dirac, V. Khare, G. Roy, T. Sun, Y. Tao, B. Townsend, E. Calleja, S. Muralidhara, and D. Karuppasamy, “Deepracer: Autonomous racing platform for experimentation learning,” in 2020 IEEE International with sim2real reinforcement Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020.
B. Balaji, S. Mallya, S. Genc, S. Gupta, L. Dirac, V. Khare, G. Roy, T. Sun, Y. Tao, B. Townsend, E. Calleja, S. Muralidhara, D. Karuppasamy, “Deepracer: autonomous racing platform for experimentation learning”, 2020年、IEEE International with sim2real reinforcement Conference on Robotics and Automation, ICRA 2020, Paris, Paris, France, 2020年5月31日~8月31日。
0.46
IEEE, 2020, pp. 2746–2754.
IEEE, 2020, pp. 2746-2754。
0.88
[13] J. Blumenkamp, A. Baude, and T. Laue, “Closing the reality gap with unsupervised sim-to-real image translation for semantic segmentation in robot soccer,” CoRR, vol.
13] J. Blumenkamp, A. Baude, T. Laue, “ロボットサッカーにおけるセマンティックセグメンテーションのための教師なしシミュレートと現実のギャップを埋める”。 訳抜け防止モード: [13 ]J. Blumenkamp, A. Baude, T. Laue ロボットサッカーにおけるセマンティックセグメンテーションのための実画像翻訳と教師なしシムとの現実のギャップを埋める」 略称はCoRR。
0.78
abs/1911.01529, 2019.
背番号1911.01529、2019。
0.39
[14] R. Traor´e, H. Caselles-Dupr´e, T. Lesort, T. Sun, N. D. Rodr´ıguez, and D. Filliat, “Continual reinforcement learning deployed in real-life using policy distillation and sim2real transfer,” CoRR, vol.
14] r. traor ́e, h. caselles-dupr ́e, t. lesort, t. sun, n. d. rodr ́ıguez, d. filliat, “ポリシー蒸留とsim2real transferを使って現実にデプロイされた継続的強化学習”。
0.65
abs/1906.04452, 2019.
abs/1906.04452、2019年。
0.47
[15] O. Pedersen, E. Misimi, and F. Chaumette, “Grasping unknown objects by coupling deep reinforcement learning, generative adversarial networks, and visual servoing,” in 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, Paris, France, May 31 - August 31, 2020.
O. Pedersen, E. Misimi, F. Chaumette, “Grasping unknown objects by coupling deep reinforcement learning, Generative adversarial network, and visual servoing” in 2020 IEEE International Conference on Robotics and Automation, ICRA 2020, ICRA 2020, Paris, Paris, France, 2020年5月31日-8月31日。 訳抜け防止モード: [15 ] O. Pedersen, E. Misimi, F. Chaumette 「深層強化学習、生成的敵ネットワーク、視覚サーボを結合して未知の物体を解析する」 IEEE International Conference on Robotics and Automation, ICRA 2020, フランス、パリ、2020年5月31日~8月31日。
0.76
IEEE, 2020, pp. 5655–5662.
IEEE, 2020, pp. 5655-5662。
0.84
[16] K. Rao, C. Harris, A. Irpan, S. Levine, J. Ibarz, and M. Khansari, “Rl-cyclegan: Reinforcement learning aware simulation-to-real,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020.
[16] k. rao, c. harris, a. irpan, s. levine, j. ibarz, m. khansari, “rl-cyclegan: reinforcement learning aware simulation-to-real”, 2020 ieee/cvf conference on computer vision and pattern recognition, cvpr 2020, seattle, wa, usa, june 13-19, 2020” (英語)
0.39
Computer Vision Foundation / IEEE, 2020, pp. 11 154–11 163.
Computer Vision Foundation / IEEE, 2020, pp. 11 154–11 163。
0.46
[17] D. Ho, K. Rao, Z. Xu, E. Jang, M. Khansari, and Y. Bai, “Retinagan: An object-aware approach to sim-to-real transfer,” in IEEE International Conference on Robotics and Automation, ICRA 2021, Xi’an, China, May 30 - June 5, 2021.
17] d. ho, k. rao, z. xu, e. jang, m. khansari, y. bai, “retinagan: an object-aware approach to sim-to-real transfer” in ieee international conference on robotics and automation, icra 2021, xi’an, china, may 30– june 5, 2021” (英語) 訳抜け防止モード: 【17】d.ho,k.rao,z.xu e. jang, m. khansari, y. bai, "retinagan : an object- aware approach to sim - to - real transfer" ieee international conference on robotics and automation, icra 2021 参加報告 中国西安、2021年5月30日~6月5日。
0.76
IEEE, 2021, pp. 10 920–10 926.
IEEE, 2021, pp. 10 920–10 926。
0.48
[18] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige, S. Levine, and V. Vanhoucke, “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” in 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018.
K. Bousmalis氏、A. Irpan氏、P. Wohlhart氏、Y. Bai氏、M. Kelcey氏、M. Kalakrishnan氏、L. Downs氏、J. Ibarz氏、P. Pastor氏、K. Konolige氏、S. Levine氏、V. Vanhoucke氏、2018年、IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018。 訳抜け防止モード: [18 ] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs J. Ibarz, P. Pastor, K. Konolige, S. Levine V. Vanhoucke氏は次のように述べている。 深層ロボットの把握の効率を向上する。 2018年IEEE International Conference on Robotics and Automation, ICRA 2018, オーストラリア、ブリスベン、2018年5月21日~25日。
0.83
IEEE, 2018, pp. 4243–4250.
IEEE, 2018, pp. 4243-4250。
0.84
[19] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.
19] i. j. goodfellow, j. pouget-abadie, m. mirza, b. xu, d. warde-farley, s. ozair, a. c. courville, y. bengio, “generative adversarial nets” in advances in neural information processing systems 27: annual conference on neural information processing systems 2014年12月8~13日 モントリオール,ケベック,カナダ, z. ghahramani, m. welling, c. cortes, n. d. lawrence, and k. q. weinberger, eds. 2014年12月8~13日 訳抜け防止モード: [19 ]I. J. Goodfellow, J. Pouget - Abadie, M. Mirza, B. Xu, D. Warde - Farley, S. Ozair, A. C. Courville Y. Bengio, “Generative adversarial nets, ” in Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014 2014年12月8日~13日、カナダのケベック州モントリオール。 Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence そしてK. Q. Weinberger, Eds
0.87
, 2014, pp. 2672– 2680.
2014年、p.2672-2680。
0.67
[20] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017.
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, P. Abbeel, “Domain randomization for transfer moving Deep Neural Network from simulation from the real world” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017 訳抜け防止モード: 20 ] j. tobin, r. fong, a. ray. j. schneider, w. zaremba, p. abbeel, “ドメインのランダム化” シミュレーションから現実世界へのディープニューラルネットワークの転送。 2017年、ieee / rsj international conference on intelligent robots and systems。 iros 2017、バンクーバー、bc、カナダ、2017年9月24日~28日。
0.80
IEEE, 2017, pp. 23–30.
同上、2017年、p.23-30。
0.49
[21] OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving rubik’s cube with a robot hand,” CoRR, vol.
OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, L. Zhang, “Solving rubik’s Cube with a robot hand” CoRR, vol. 訳抜け防止モード: [21]OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert G. Powell, R. Ribas, J. Schneider, N. Tezak J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, L. Zhang ロボットの手でルービックキューブを解く、とCoRR,volは言う。
0.96
abs/1910.07113, 2019.
背番号1910.07113、2019。
0.39
[22] A. Juliani, A. Khalifa, V. Berges, J. Harper, E. Teng, H. Henry, A. Crespi, J. Togelius, and D. Lange, “Obstacle tower: A generalization challenge in vision, control, and planning,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 2684–2691.
A. Juliani, A. Khalifa, V. Berges, J. Harper, E. Teng, H. Henry, A. Crespi, J. Togelius, D. Lange, “Obstacle Tower: A generalization Challenge in vision, control, and Planning” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019, pp. 2684–2691。 訳抜け防止モード: 22 ] a. juliani, a. khalifa, v. berges, j. harper, e. teng, h. henry, a. crespi. j. togelius, d. lange, and d. lange 「障害物タワー : ビジョン、制御、計画における一般化チャレンジ」 第28回人工知能国際合同会議の開催にあたって ijcai 2019年、2019年、p.2684-2691。
0.68
[23] S. Mish, “Rocket league notes,” 2019, available at https://samuelpmish.
S.Mish, “Rocket League Note” 2019は、https://samuelpmish. comで入手できる。
0.74
github.io/notes/Rock etLeague/ retrieved February 28, 2022.
github.io/notes/Rock etLeague/ 2022年2月28日閲覧。
0.49
is at [24] Timo Huth, “Dodges explained. power & more - rocket science #14,” 2018, available at https://www.youtube. com/watch?
は に Timo Huth, “Dodges explained. power & more - rocket science #14” 2018, available at https://www.youtube. com/watch? 訳抜け防止モード: は に [24 ] Timo Huth, “Dodges explained . power & more - rocket science # 14” と書いている。 2018年、https://www.youtube. com/watch ?
0.76
v=pX950bhGhJE retrieved February 28, 2022.
v=pX950bhGhJE 2022年2月28日。
0.61
rocket league’ detailed,” 2018, available at https://www.gdcvault .com/play/1024972/ It-IS-Rocket-Science -The retrieved February 28, 2022.
available [29] M. Pleines, F. Zimmer, and V. Berges, “Action spaces in deep reinforcement learning to mimic human input devices,” in IEEE Conference on Games, CoG 2019, London, United Kingdom, August 20-23, 2019.
利用可能 29] m. pleines, f. zimmer, v. berges, “action spaces in deep reinforcement learning to mimic human input devices” in ieee conference on games, cog 2019, london, united kingdom, august 20-23, 2019” (英語) 訳抜け防止モード: 利用可能 [29 ]M.Pleines, F. Zimmer, V. Berges 『人間入力装置を模倣する深層強化学習における行動空間』 IEEE Conference on Games, CoG 2019, London, UK August 20 - 23 , 2019 .
0.69
IEEE, 2019, pp. 1–8.
IEEE, 2019, pp. 1-8。
0.87
[30] J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel, “Highdimensional continuous control using generalized advantage estimation,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds.
J. Schulman, P. Moritz, S. Levine, M. I. Jordan, P. Abbeel, “highdimensional continuous control using generalized advantage estimation” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016 Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds。 訳抜け防止モード: J. Schulman, P. Moritz, S. Levine, M.I.ジョーダンとP. Abbeelは「一般化された優位推定を用いた高次元連続制御」と述べている。 第4回国際学習表現会議報告 ICLR 2016 サンフアン, プエルトリコ 2016年5月2日から4日にかけて、Y. Bengio、Y. LeCun、Eds。
0.69
, 2016. [31] R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, and M. G. Bellemare, “Deep reinforcement learning at the edge of the statistical precipice,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
, 2016. R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, M. G. Bellemare, “Deep reinforcement learning at the edge of the statistics precipice” in Thirth-Fifth Conference on Neural Information Processing Systems, 2021。 訳抜け防止モード: , 2016. [31 ] R. Agarwal, M. Schwarzer, P. S. Castro A.Courville と M.G. Bellemare は,「統計的沈み込みの端で深い強化学習をする」と述べている。 ニューラル情報処理システムに関する第5回国際会議
0.58
[32] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 1418, 2009, ser.
32] y. bengio, j. louradour, r. collobert, j. weston, "curriculum learning" 第26回年次機械学習国際会議, icml 2009 モントリオール, ケベック, カナダ, 2009年6月。
0.61
ACM International Conference Proceeding Series, A. P. Danyluk, L. Bottou, and M. L. Littman, Eds.
ACM International Conference Proceeding Series, A. P. Danyluk, L. Bottou, M. L. Littman, Eds 訳抜け防止モード: ACM国際会議, A. P. Danyluk, L. Bottou そして、M・L・リットマン、エド。
0.81
, vol. 382.
ヴォル。 382.
0.36
ACM, 2009, pp. 41–48.
ACM、2009年、p.41-48。
0.62
[33] V. Gullapalli and A. Barto, “Shaping as a method for accelerating reinforcement learning,” in Proceedings of the 1992 IEEE International Symposium on Intelligent Control, 1992, pp. 554–559.
[33] v. gullapalli氏とa. barto氏は、1992年のieee international symposium on intelligent control, 1992, pp. 554–559で、“強化学習を加速する方法として形作る”と書いている。 訳抜け防止モード: [33 ] V. Gullapalli と A. Barto は,「強化学習を加速させる方法としての形成」と評した。 In Proceedings of the 1992 IEEE International Symposium on Intelligent Control, 1992 554-559頁。