(参考訳) 多くの階層的強化学習アルゴリズムは、より高い推論レベルでタスクを解決するために、一連の独立したスキルを基礎として利用している。
これらのアルゴリズムは、独立ではなく協力的なスキルを使う価値を考慮しない。
本稿では,連続エージェントが長期水平多段階タスクを協調的に解決できる協調的協調政策(CCP)手法を提案する。
この方法は、各エージェントのポリシーを変更して、現在のエージェントと次のエージェントの批評家の両方を最大化する。
批評家を協調的に最大化することで、各エージェントはそのタスクに有益な行動を取ることができる。
この手法をマルチルームmazeドメインとpeg in hole manipulationドメインで使用することにより,協調ポリシは,複数のナイーブポリシ,ドメイン全体でトレーニングされた1つのエージェント,その他の逐次hrlアルゴリズムを上回ることができた。
Many hierarchical reinforcement learning algorithms utilise a series of
independent skills as a basis to solve tasks at a higher level of reasoning.
These algorithms don't consider the value of using skills that are cooperative
instead of independent. This paper proposes the Cooperative Consecutive
Policies (CCP) method of enabling consecutive agents to cooperatively solve
long time horizon multi-stage tasks. This method is achieved by modifying the
policy of each agent to maximise both the current and next agent's critic.
Cooperatively maximising critics allows each agent to take actions that are
beneficial for its task as well as subsequent tasks. Using this method in a
multi-room maze domain and a peg in hole manipulation domain, the cooperative
policies were able to outperform a set of naive policies, a single agent
trained across the entire domain, as well as another sequential HRL algorithm.
These algorithms don’t consider the value of using skills that are cooperative instead of independent.
これらのアルゴリズムは、独立ではなく協力的なスキルを使う価値を考慮しない。
0.79
This paper proposes the Cooperative Consecutive Policies (CCP) method of enabling consecutive agents to cooperatively solve long time horizon multi-stage tasks.
Cooperatively maximising critics allows each agent to take actions that are beneficial for its task as well as subsequent tasks.
批評家を協調的に最大化することで、各エージェントはそのタスクに有益な行動を取ることができる。
0.65
Using this method in a multiroom maze domain and a peg in hole manipulation domain, the cooperative policies were able to outperform a set of naive policies, a single agent trained across the entire domain, as well as another sequential HRL algorithm.
この手法をマルチルームmazeドメインとpeg in hole manipulationドメインで使用することにより,協調ポリシは,複数のナイーブポリシ,ドメイン全体でトレーニングされた1つのエージェント,その他の逐次hrlアルゴリズムを上回ることができた。
0.80
Index Terms—Reinforcement Learning
index terms - 強化学習
0.66
I. INTRODUCTION
I. イントロダクション
0.64
M ANY of the struggles that Reinforcement Learning
m 学習の強化に かかわる困難は
0.55
(RL) faces have been addressed within the field of Heirarchical Reinforcement Learning (HRL) [1].
(rl)顔は後進的強化学習(hrl)の分野において対処されている[1]。
0.67
Complex domains such as ant [2], humanoid [3] and swimmer [4] have all had high levels of success with hierarchical methods while non-hierarchical methods have struggled to make progress.
Hierarchical methods take advantage of the ability of many tasks to be abstracted, solving the problem of high level reasoning separately from low level control [5].
For example, for the task of cutting a cake, a high level controller may employ the skill of grasping a knife, followed by manoeuvring the knife above the cake, and finally lowering the knife to cut the cake.
A skill trained to grasp a knife may choose to grasp the knife by the blade, a behaviour appropriate for other tasks, such as passing a knife safely to a human.
[Note that the Editor is the Senior Editor who communicated the decision; this is not necessarily the same as the Editor-in-Chief.]
(編集長が決定を伝達した編集長であることに注意。これは編集長と必ずしも同じではない。)
0.69
1J. Erskine and C. Lehnert are with the Queensland University of Technology (QUT), Brisbane, Australia and affiliated with Queensland Centre of Robotics (QCR) jordan.erskine@hdr.q ut.edu.au, c.lehnert@qut.edu.au
1J。 アースキンとc. lehnertはオーストラリアのブリスベンにあるクイーンズランド工科大学(qut)に所属しており、クイーンズランド・ロボティクスセンター(qcr)のjordan.erskine@hdr.q ut.edu.au、c.lehnert@qut.edu.au と提携している。 訳抜け防止モード: 1J。 Erskine and C. Lehnert are with the Queensland University of Technology (QUT) オーストラリアのブリスベンでQueensland Centre of Robotics (QCR ) jordan.erskine@hdr.q ut.edu.au,c.lehnert@ qut.edau と提携している。
0.59
Digital Object Identifier (DOI): see top of this page.
Digital Object Identifier (DOI): このページの上部を参照。
0.83
Fig. 1. A high level view of the Cooperative Consecutive Policies method works.
to work together more cohesively to complete multi-stage tasks.
より密集的に連携して、マルチステージタスクを完了します。
0.51
In our proposed approach, each agent is incentivised to cooperate with the next agent by training the agent’s policy network to produce actions that maximise both the current agent’s critic and the next agent’s critic, weighted by an introduced parameter, the cooperative ratio.
By incorporating the next agent’s critic, the current agent can continue to achieve its own goal while also producing a solution that is beneficial for the next agent.
The contributions produces by this paper are: • A novel method, Cooperative Consecutive Policies (CCP), which enables agents to learn behaviours that maximise reward for their own task while accommodating for subsequent tasks, improving performance for learning multi-stage tasks.
• Two case studies in a continuous state/action space maze domain and a robotic manipulation domain.
•連続状態/動作空間mazeドメインとロボット操作ドメインの2つのケーススタディ。
0.80
In these experiments, CCP outperformed a set of naive policies (trained to greedily maximise subtask reward), a single agent trained end-to-end on the same task, as well as sequential HRL baseline, using the transition policies method.
• An ablation study on the effect of varying the cooperative ratio.
(4)協力比率の変化による影響に関するアブレーション研究。
0.66
This includes a study on the effect of the cooperative ratio on success rate, as well as the ability to use different cooperative ratios for different cooperative policies.
HRL methods capitalise on the inherent structure that is present in many tasks.
HRL法は、多くのタスクに存在する固有の構造に依存している。
0.64
An important benefit of using HRL is that learning a series of smaller, simpler skills is easier and faster to learn than learning to solve a single, more complex task [9] [10] [11] .
Typically they have the form of using a meta-controller in conjunction with a series of lower-level policies for each subtask [14] [15] [16] [17], which allows for more generalisation, as the series of subtasks can be combined in more versatile ways.
Separate agents are then trained to solve these subtasks independently.
独立したエージェントは、これらのサブタスクを独立して解くように訓練される。
0.45
The fact that these agents are not incentivised to assist in solving the overall task means that without careful engineering of subtasks, suboptimal solutions to tasks are likely [23].
Our method is designed to overcome the difficulties of engineering subtasks by enabling the agents to cooperate towards completing the task, loosening the requirements of careful design.
There are other methods that seek to do a solve a similar problem by learning transition policies between previously learnt subtasks [24] [25], but these transition policies may still struggle if the solution to one subtask is too far from an adequate starting position for the subsequent subtask.
III. METHOD critic algorithm, assuming there are consecutive agents used to solve a task.
III。 方法 批判アルゴリズム 連続エージェントが タスクを解くのに使われてると仮定する
0.59
The experiments described in this paper were implemented using the SAC algorithm [26] [27].
本稿では,SACアルゴリズム [26] [27] を用いて実験を行った。
0.77
A. Procedure We consider a modified MDP formulation for solving the problem of picking optimal actions to solve a task.
A.手続き 課題解決のための最適な行動を選択する問題を解くための修正MDPの定式化を検討する。
0.74
We assume an environment that involves a task that is decomposed into a series of N subtasks.
我々は、一連のNサブタスクに分解されたタスクを含む環境を仮定する。
0.81
At each timestep t an agent can take an action at from the current state st, which results in the environment evolving to the next state st+1, producing a transition signal U (st) ∈ [1, N ] that determines what subtask is currently active, and produces a series of N reward signals rn,t|n ∈ [1, N ] that correspond to each subtask.
The CCP method requires creating N agents, one for each subtask, that each consist of a policy πn|n ∈ [1, N ] with parameters θn and a critic Qn|n ∈ [1, N ] with parameters βn.
CCP法では、各サブタスクごとに N 個のエージェントを作成する必要があり、それぞれパラメータ θn のポリシー πn|n ∈ [1, N ] とパラメータ βn の批判 Qn|n ∈ [1, N ] からなる。
0.87
This method requires that the subsequent agent is known to the current agent.
この方法は、後続のエージェントが現在のエージェントに知られなければならない。
0.60
Each critic maximises the discounted sum of future rewards rn from its subtask.
各批評家は、そのサブタスクから将来の報酬の割引金額を最大化する。
0.62
Each policy, rather than maximising the critic that corresponds to their own subtask, instead maximises a convex sum of the current and subsequent critics.
Using both the subsequent and current critic allows a policy to act to solve the current subtask in a way that allows an effective solution in future subtasks.
The CCP method is not an algorithm in and of itself.
ccp法はそれ自体がアルゴリズムではない。
0.61
It is an algorithmic change that can be applied to any actor
これは任意のアクターに適用可能なアルゴリズム上の変更である
0.68
Where M is the number of samples in a batch b sampled from replay buffer Bn, α is the entropy maximisation term, and C is a convex combination of the current and subsequent critics, as defined by: C(Qn, Qn+1) = η ˆQn(s, πn(s))+(1−η) ˆQn+1(s, πn(s)) (3) Where η is the cooperative ratio.
M がリプレイバッファ Bn からサンプリングされたバッチ b のサンプル数である場合、α はエントロピー最大化項であり、C は電流とその後の批評家の凸結合であり、C(Qn, Qn+1) = η の Qn(s, πn(s))+(1−η) の Qn+1(s, πn(s)) (3) η は協同比である。
0.84
This ratio affects how much the current policy acts with respect to the subsequent critic, and is a number between 0 and 1.
この比率は、現在の政策がその後の批評家に対してどの程度振る舞うかに影響し、0から1の間の数である。
0.68
A cooperative ratio closer to 1 is
1に近い協力比率は
0.67
英語(論文から抽出)
日本語訳
スコア
ERSKINE et al : CCP
ERSKINE et al : CCP
0.42
3 Algorithm 1: Gathering Data Environment with N subtasks and associated reward signals r(1,...,N ); For each subtask initialise an agent An, including a policy πn with parameters θn, a critic Qn with parameters βn, and a replay buffer Bn; while timestep < maxTimestep do
3 アルゴリズム1: n個のサブタスクと関連する報酬信号r(1,...,n)でデータ環境を収集する 各サブタスクは、パラメータθnのポリシーπn、パラメータβnの批評家qn、リプレイバッファbnを含むエージェントanを初期化する。 訳抜け防止モード: 3 アルゴリズム1:Nサブタスクと関連する報酬信号r(1, ..., N )によるデータ環境の収集; 各サブタスクがエージェントAnを初期化する パラメータ θn を持つポリシー πn、パラメータ βn を持つ批判 Qn を含む。 and a replay buffer Bn ; while timestep < maxTimestep do
0.65
s, n ←− reset environment; while not done do
s, n--リセット環境; 完了していないが
0.77
a ∼ πn(s); s(cid:48), r(1,...,N ), done ←− environment step with a; record (s, a, r1,...,N , s(cid:48), done) in Bn; s ←− s(cid:48); n ←− U (s(cid:48));
s(cid:48), r(1,...,n ) は a で環境ステップを成す; record (s, a, r1, ...,n , s(cid:48), done) in bn; s が s(cid:48); n が u(s(cid:48)) である。 訳抜け防止モード: a s πn(s ) ; s(cid:48 ), r(1, ...,n ) は s, a, r1, ...,n, である。 s(cid:48 ), done ) in bn ; s ~− s(cid:48 ) ; ; s(cid:48 ) n- u ( s(cid:48 ) ; ;
0.75
end end Fig. 3.
終わり 終わり 図3。
0.69
A 3 room example of the Maze domain.
mazeドメインの3部屋の例。
0.61
The first agent begins in the starting area and produces actions to navigate the first room.
As the agent enters the next room, the next takes charge to navigate that room.
エージェントが次の部屋に入ると、次のエージェントがその部屋をナビゲートする。
0.64
The agent’s scan range is shown in purple.
エージェントのスキャン範囲は紫で示されています。
0.80
incentivised to maximise the current critic’s estimate, whereas a cooperative ratio closer to 0 is incentivised to maximise the subsequent critic’s estimate.
Refer to appendix (Section VIII) for mathematical analysis of this method.
この方法の数学的解析については付録(第8章)を参照のこと。
0.61
This implementation of CCP is designed using SAC as the base algorithm.
CCPの実装はベースアルゴリズムとしてSACを用いて設計されている。
0.66
To apply CCP to other algorithms is done by modifying the algorithm’s policy update to using C in place of the traditional critic evaluation, and ensuring each agent is updated using the correct buffer B, as shown in Figure 2.
Two domains were used to test the efficacy of the CCP
CCPの有効性をテストするために2つのドメインが使用された
0.67
method: the maze domain, and the peg in hole domain.
方法: mazeドメインとpeg in holeドメイン。
0.53
Algorithm 2: Cooperative Training Set of N agents An, each with policy πn with parameters θn, a critic Qn with parameters βn, and a replay buffer Bn; cooperative ratio η; discount factor γ; entropy maximisation term α; for n in (1,...,N) do
アルゴリズム2:nエージェントanの協調トレーニングセット、パラメータθnのポリシーπn、パラメータβnの批判qn、リプレイバッファbn、協調比η、ディスカウント因子γ、エントロピー最大化項α、n in (1,...,n) do 訳抜け防止モード: アルゴリズム2 : パラメータ θn のポリシ πn を持つ N エージェント An の協調トレーニングセット パラメータβn、リプレイバッファBn、協調比η、割引係数γ、エントロピー最大化項α、n in ( 1, ... , N ) に対する批評家Qn
0.75
sample minibatch b from Bn of M samples −→ (s, a, r(1,...,N ), s(cid:48), d); for j in (n, n+1) do
m のサンプル −→ (s, a, r(1, ..., n ), s(cid:48), d); j in (n, n+1) do からのサンプルミニバッチ b
0.78
(cid:80)(Qj(s, a) − y(rj, s(cid:48), d))2;
(cid:80)(qj(s, a) − y(rj, s(cid:48), d))2;
0.47
y(rj, s(cid:48), d) = rj + (1 − d)γ(Qj(s(cid:48), a(cid:48)) − α log(πn(a(cid:48)|s(cid:48)))), a(cid:48) ∼ πn(s(cid:48)); ∇βj = 1 end a(cid:48) ∼ πn(s); for j in (n, n+1) do
y(rj, s(cid:48), d) = rj + (1 − d)γ(qj(s(cid:48), a(cid:48)) − α log(πn(a(cid:48)|s(cid:48))), a(cid:48) , πn(s(cid:48)); , βj = 1 end a(cid:48) , πn(s); j in (n, n+1) do に対して。
0.48
M Calculate ˆQn across minibatch b according to Equation 4;
M Equation 4 によるミニバッチ b を計算します。
0.60
end C(Qn, Qn+1) = η ˆQn(s, a(cid:48)) + (1 − η) ˆQn+1(s, a(cid:48)); ∇θn = 1
end c(qn, qn+1) = η ]qn(s, a(cid:48)) + (1 − η) ]qn+1(s, a(cid:48)); ]θn = 1 である。
0.82
(cid:80)(α log πn(a(cid:48)|s) − C(Qn, Qn+1));
(cid:80)(α log πn(a(cid:48)|s) − c(qn, qn+1));
0.45
M end 1) Maze Navigation:
M 終わり 1) mazeナビゲーション
0.58
In the maze domain, a series of consecutive rooms were created.
迷路藩では一連の連続した部屋が作られた。
0.57
Each room had two paths to exit into the next room, one of which leads to a dead end.
それぞれの部屋は隣の部屋へ出る道が2つあり、そのうちの1つは死に至る。
0.71
Each room was considered a subtask with the goal being to exit the room, with a reward signal dependent on horizontal position, starting at 0 on the left of the room and linearly increasing to 1 at the right of the room.
The overall task in this domain was to travel through all the rooms to get from one side to the other.
この領域での全体的な任務は、部屋を行き来して片方からもう一方に移動することだった。
0.69
The rooms are designed such that an optimal subtask solution would not lead to an optimal overall solution, as the reward signal for each task increases as the position moves to the right, not towards the correct path.
This domain allows for easy extension in terms of the number of subtasks that are required to be able to solve the overall task.
このドメインは、タスク全体を解決するために必要なサブタスクの数という観点で簡単に拡張できる。
0.69
The agent in this domain uses two continuous actions: linear velocity and angular velocity.
この領域のエージェントは、線形速度と角速度の2つの連続的なアクションを使用する。
0.64
The agent uses an observation of the environment includes a laser scan as well as a global position in the maze, all of which are continuous measurements.
The second subtask is to move the peg towards and then into the hole.
第2のサブタスクは、ペグを穴に移動させることである。
0.61
The reward for this subtask is also an exponentially decaying reward, this time based on the distance between the centre of the peg and the centre of the hole.
If the peg is grasped by the thin section, then the insertion subtask cannot be completed optimally, as the peg cannot be fully inserted into the hole.
The agent in this domain has two continuous actions: end effector velocity along the axis of actuation and gripper force.
この領域のエージェントは、運動軸に沿った端エフェクター速度とグリップ力の2つの連続的な作用を持つ。
0.70
The state that the agent observes is the positions, orientations and velocities of the peg, hole and end effector.
エージェントが観察する状態は、peg、ホールおよびエンドエフェクターの位置、向きおよび速度である。
0.62
Both domains in this paper use relatively simple reward signals.
両ドメインとも比較的単純な報酬信号を使用する。
0.63
Though more complex, and potentially more informative, reward signals could be constructed, these simple reward signals were used on purpose to show that using a imperfect reward signal can be overcome using the CCP method.
Engineering a reward signal that can effectively avoid globally suboptimal behaviours is an expensive and sometimes intractable problem, and being able to solve a task without it is a valuable quality.
B. Algorithms Four different methods were evaluated in these domains: • CSAC: The CSAC method is an implementation of Soft Actor Critic (SAC) [26] [27] that utilises the CCP methodology of incentivising cooperative solving of sequential tasks, making a set of Cooperative SAC agents.
Each agent is attempting to maximise solely its own reward signal.
各エージェントは、報酬信号のみを最大化しようとしている。
0.65
This method represents the naive approach of treating each agent independently.
この方法は、各エージェントを個別に扱うというナイーブなアプローチを表している。
0.50
• SAC: The SAC method involves using a single endto-end SAC agent trained to perform across the whole domain, utilising a reward signal that is the combined reward signal from all the subtasks.
This method represents the standard RL approach to solving a task.
この方法は、タスクを解くための標準RLアプローチを表す。
0.75
• TP: The Transition Policies method [24] used as a baseline.
• TP: ベースラインとして使用されるTransition Policiesメソッド [24]。
0.85
This method trains primitive policies to complete each subtask, and then trains transition policies to move from subtask termination states to states that are good initialisations for the subsequent subtask.
Hyperparameters and further implementation details are
ハイパーパラメータとさらなる実装の詳細は
0.78
listed in Appendix 2 (Section VII).
Appendix 2 (Section VII)に記載されている。
0.69
C. Studies 1) Optimal cooperative ratio: The first study tested the performance of the four different methods in both the maze domain and the peg in hole domain.
In the maze domain three different length mazes were tested using a sweep across the cooperative ratio parameter to determine its effects on learning performance, with a similar sweep conducted in the peg in hole domain.
mazeドメインでは,peg in holeドメインで同様のスイープを行い,協調比パラメータのスイープを用いて3つの異なる長さの迷路をテストし,学習性能への影響を判定した。
0.75
The best results from these sweeps were
このスイープの最良の結果は
0.63
SUCCESS RATE OF EACH METHOD IN MAZE AND PEG IN HOLE DOMAINS
- then compared to the results from using the other methods.
- そして、他の方法を使った結果と比較します。
0.58
This study also investigated the sensitivity of the method’s performance with respect to the cooperative ratio.
本研究は,協調比に対する手法の性能の感度についても検討した。
0.76
2) Independent cooperative ratios: The previous experiment used the same cooperative ratio for each agent within each experiment.
2) 独立協力比率: 前回の実験では各実験で同一の協力比率を用いていた。
0.84
An investigation was conducted to determine whether using independent cooperative ratios for each agent would improve the performance, conducted in the 3 room maze domain, in which there are two cooperative agents.
A parameter sweep was conducted across the cooperative ratio for each cooperative agent to investigate it’s effect on learning performance.
学習成績に影響を及ぼす影響を調査するため,各協調エージェントの協調比でパラメータスイープを行った。
0.72
A. Optimal cooperative ratio V. RESULTS
A.最適協調比 V. ResuLTS
0.39
1) Maze navigation: The results for the 2 room experiment within the maze domain are presented in Figure 4 and are summarised in Table I. This experiment shows that the cooperative and naive agents both learn a successful policy in a similar time period, whereas the single agent policy took more than 3 times as long to reach a similar performance.
The naive agents, though they reached a high level of success quickly, had a decaying performance.
ナイーブのエージェントはすぐに高いレベルの成功を収めたが、性能は低下した。
0.70
This is due to the fact that each agent learned a solution to their domain quickly, which included travelling to the further door.
これは、各エージェントが、次のドアへの移動を含む、自身のドメインの解決策を素早く学んだためである。
0.66
As each agent refined its solution, the shorter path, which is suboptimal overall, was used more frequently, reducing the performance of the overall task.
This experiment shows that decomposing a task into subtasks is beneficial in terms of training speed, shown by the relative training speed of the cooperative and naive policies compared to the single agent.
This experiment also shows that just decomposing a task into subtasks and then treating them as entirely separate problems can lead to suboptimal or decaying solutions.
Figures 4 and Table I show the success rates of the three different agent types in the 3 and 4 room mazes, where success is defined as reaching the end of the maze.
Using the individually tuned cooperative ratios, a higher level of performance was found in the 3 room domain compared to using the same cooperative ratio for both policies.
Using cooperative ratios of 0.1 and 1.0 for the first two policies respectively result level of success (78%), outperforming the policies trained using a shared cooperative ratio (54%).
in the highest Some insights can be gathered from analysing the effect on success rate when changing the cooperative ratio by looking at Table I and Figure 6.
This demonstrates the value of splitting a task into subtasks and then solving them cooperatively.
これはタスクをサブタスクに分割し、それらを協調的に解決する価値を示しています。
0.57
Across all room configurations the TP baseline had a suboptimal level of performance.
全ての部屋構成において、TPベースラインは最適以下の性能を有していた。
0.59
This method learns a set of transition policies that attempts to manipulate the agent from the termination state of a subtask to a good starting state for the subsequent subtask.
Due to the way the doorways are arranged within the maze, a subtask that concludes at a dead end is too far from a good starting state for the next subtask for a transition policy to be able to rectify.
The low success rate in this domain is due to the difficulty of using contact physics.
この領域での成功率が低いのは、接触物理学を使うのが難しいためである。
0.63
If the agent grasps the peg in the wrong way or applies too much force, the peg can be pushed into a pose that is unreachable for the robot, making the episode unsolvable.
Future works will seek to address these issues and make this method more robust.
今後の作業では、これらの問題に対処し、このメソッドをより堅牢にする予定である。
0.51
REFERENCES Fig. 6. The measured success rate in the 3 room maze domain with different cooperative ratios for each cooperative policy.
参考 図6。 協力政策ごとに異なる協力比率の3室迷路領域における成功率を測定した。
0.63
The success rate is an average success rate across the last 10 epochs of each of the 10 different randomly initialised iterations of each configuration.
Blue represents a low success rate and red represents a high success rate.
青は成功率が低いこと、赤は成功率が高いことを表す。
0.72
comparing the cooperative ratios for the first policy.
第1の政策の協力比率を比較する。
0.79
Using a cooperative ratio of 0.5 for either policy leads to lower performance as seen in the shared ratio experiments.
いずれの政策に対して0.5の協調比率を用いると、共有比実験に見られるように性能が低下する。
0.72
The independent ratio experiments also show that using a low cooperative ratio for the first policy has a higher success rate than using a high cooperative ratio, similar to results in the 2 room results in the shared ratio experiments.
VI. CONCLUSION This paper introduces the CCP method for cooperatively solving multi-stage tasks.
VI。 結論 本稿では,多段階タスクを協調的に解くCCP手法を提案する。
0.65
This method was tested using the SAC algorithm (implementation called CSAC) in two different domains, the maze domain and the peg in hole domain, and was compared against three other methods; a SAC agent trained end-to-end across the whole domain, a set of naive agents trained to solve each subtask greedily, and a baseline HRL algorithm for sequential tasks, the Transition Policies algorithm.
この手法はmazeドメインとpeg in holeドメインの2つの異なるドメインにおけるsacアルゴリズム(csacと呼ばれる実装)を用いてテストされ、sacエージェントがドメイン全体にわたってエンドツーエンドを訓練し、各サブタスクを厳格に解くように訓練されたナイーブエージェントと、シーケンシャルタスクのためのベースラインhrlアルゴリズムであるtransition policyアルゴリズムと、他の3つの方法と比較された。
0.74
The CCP method outperformed each other method in the maze domain, as summarised in Table I and shown in Figure 4.
CCP法は,表Iで要約し,図4に示すように,迷路領域において互いに優れていた。
0.70
In the simplest domain (2 room maze), CSAC converged on a solution 4 times faster than the single agent and was able to maintain a high level of performance that the naive policies were not able to maintain, while the TP baseline was unable to solve the domain.
In the more complex domains (3 and 4 room mazes), the cooperative policies had a consistently higher level of performance than the naive policies and TP baseline, whereas the single agent was not able to find any solution to the task within 3 million training steps.
Similar results were found in the peg in hole domain (Figure 5), in which the algorithm using CSAC had a success rate approximately 30% higher than the other methods.
Additionally, the cooperative ratio variable is required to be
さらに、協調比率変数が要求される。
0.63
[1] O. Nachum, H. Tang, X. Lu, S. Gu, H. Lee, and S. Levine, “Why does hierarchy (sometimes) work so well in reinforcement learning?” arXiv preprint arXiv:1909.10618, 2019.
[1] O. Nachum, H. Tang, X. Lu, S. Gu, H. Lee, S. Levine, “なぜ階層構造が強化学習でうまく機能するのか?” arXiv preprint arXiv:1909.10618, 2019.
0.91
[2] O. Nachum, S. Gu, H. Lee, and S. Levine, “Near-optimal representation learning for hierarchical reinforcement learning,” arXiv preprint arXiv:1810.01257, 2018.
O. Nachum, S. Gu, H. Lee, S. Levine, “Near-Optitimal representation learning for Hierarchical reinforcement learning, arXiv preprint arXiv:1810.01257, 2018”。 訳抜け防止モード: [2] O. Nachum, S. Gu, H. Lee とS. Levine氏は述べている。 arXiv preprint arXiv:1810.01257 , 2018。
0.76
[3] X. B. Peng, M. Chang, G. Zhang, P. Abbeel, and S. Levine, “Mcp: Learning composable hierarchical control with multiplicative compositional policies,” arXiv preprint arXiv:1905.09808, 2019.
[3]X.B. Peng, M. Chang, G. Zhang, P. Abbeel, S. Levine, “Mcp: 複合可能な階層制御を多成分構成ポリシーで学習する, arXiv preprint arXiv: 1905.09808, 2019”。
0.82
[4] S. Li, R. Wang, M. Tang, and C. Zhang, “Hierarchical reinforcement learning with advantage-based auxiliary rewards,” arXiv preprint arXiv:1910.04450, 2019.
4] s. li, r. wang, m. tang, c. zhang, “hierarchical reinforcement learning with advantage-based auxiliary rewards” arxiv preprint arxiv:1910.04450, 2019”. arxiv と題された。
0.65
[5] M. Wulfmeier, D. Rao, R. Hafner, T. Lampe, A. Abdolmaleki, T. Hertweck, M. Neunert, D. Tirumala, N. Siegel, N. Heess et al , “Dataefficient hindsight off-policy option learning,” in International Conference on Machine Learning.
5] m. wulfmeier, d. rao, r. hafner, t. lampe, a. abdolmaleki, t. hertweck, m. neunert, d. tirumala, n. siegel, n. heess et al, “data efficient hindsight off-policy option learning” in international conference on machine learning” (英語) 訳抜け防止モード: [5 ]Wulfmeier, D. Rao, R. Hafner, T. Lampe, A. Abdolmaleki, T. Hertweck, M. Neunert D. Tirumala, N. Siegel, N. Heess, al, “Dataefficient hindsight off – policy option learning”。 機械学習国際会議に参加して
0.46
PMLR, 2021, pp. 11 340–11 350.
PMLR, 2021, pp. 11 340–11 350。
0.94
[6] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu, “Feudal networks for hierarchical reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.
6] a. s. vezhnevets氏, s. osindero氏, t. schaul氏, n. heess氏, m. jaderberg氏, d. silver氏, k. kavukcuoglu氏, “feudal networks for hierarchical reinforcement learning” は,第34回機械学習に関する国際会議(international conference on machine learning-volume 70)のセッションで取り上げられた。
0.58
JMLR. org, 2017, pp. 3540–3549.
jmlr。 2017年、p.3540-3549。
0.48
[7] O. Nachum, S. S. Gu, H. Lee, and S. Levine, “Data-efficient hierarchical reinforcement learning,” in Advances in Neural Information Processing Systems, 2018, pp. 3303–3313.
O. Nachum, S. S. Gu, H. Lee, S. Levine, “Data- efficient hierarchical reinforcement learning” in Advances in Neural Information Processing Systems, 2018, pp. 3303–3313。 訳抜け防止モード: [7] O. Nachum, S. S. Gu, H. Lee データ - 効率的な階層的強化学習 - S. Levine氏の講演より In Advances in Neural Information Processing Systems, 2018, pp. 3303–3313.
0.85
[8] M. Al-Emran, “Hierarchical reinforcement learning: a survey,” Interna-
8]m.al-emran, "階層的強化学習:調査" interna-
0.77
tional journal of computing and digital systems, vol.
mental journal of computing and digital systems, vol. (英語)
0.74
4, no. 02, 2015.
4号、2015年、02号。
0.64
[9] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
[11] S. Iqbal and F. Sha, “Actor-attention-crit ic for multi-agent reinforcement learning,” in International Conference on Machine Learning.
11]s. iqbal氏とf. sha氏は、international conference on machine learningで、“actor-attention-crit ic for multi-agent reinforcement learning”と題した講演を行った。 訳抜け防止モード: 11]s. iqbalとf. sha, “ actor - attention - critic for multi- agent reinforcement learning” 機械学習に関する国際会議に出席。
0.70
PMLR, 2019, pp. 2961–2970.
pmlr、2019年、p.2961-2970。
0.55
[12] J. Oh, S. Singh, H. Lee, and P. Kohli, “Zero-shot task generalization with multi-task deep reinforcement learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.
12]j. oh, s. singh, h. lee, p. kohli, “zero-shot task generalization with multi-task deep reinforcement learning”(多タスク深層強化学習によるゼロショットタスク一般化)は、第34回機械学習ボリューム70に関する国際会議(international conference on machine learning-volume 70)で発表された。 訳抜け防止モード: 【12】j・o・s・シン・h・リー】 p. kohli, “ゼロ-ショットタスク一般化とマルチ-タスク深層強化学習” 第34回機械学習国際会議紀要-第70巻
0.70
JMLR. org, 2017, pp. 2661–2670.
jmlr。 2017年、p.2661-2670。
0.50
[13] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning, 2018, pp. 2107–2116.
a b [13] R. T. Icarte, T. Klassen, R. Valenzano, S. McIlraith氏は, International Conference on Machine Learning, 2018, pp. 2107–2116で,“高レベルのタスク仕様と強化学習の分解に報酬機を使用する”と述べている。 訳抜け防止モード: [13 ]R. T. Icarte, T. Klassen, R. Valenzano そしてS. McIlraith氏は,“高レベルのタスク仕様と強化学習の分解に報酬機を使用する”。 In International Conference on Machine Learning, 2018, pp. 2107–2116.
0.83
[14] P. -L. Bacon, J. Harb, and D. Precup, “The option-critic architecture,” in
14]P。 -L。 Bacon, J. Harb, D. Precup, “The option-critic architecture”
0.37
Thirty-First AAAI Conference on Artificial Intelligence, 2017.
第30回 aaai conference on artificial intelligence, 2017 参加報告
0.63
[15] A. Azarafrooz and J. Brock, “Hierarchical soft actor-critic: Adversarial exploration via mutual information optimization,” arXiv preprint arXiv:1906.07122, 2019.
A. Azarafrooz and J. Brock, “Hierarchical soft actor-critic: Adversarial exploration via mutual information optimization” arXiv preprint arXiv:1906.07122, 2019. 訳抜け防止モード: A. Azarafrooz and J. Brock, "階層的ソフトアクター - 批評家 : 相互情報最適化による敵対的探索" arXiv preprint arXiv:1906.07122, 2019
0.84
[16] A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman, “Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning,” arXiv preprint arXiv:1910.11956, 2019.
A. Gupta, V. Kumar, C. Lynch, S. Levine, K. Hausman, “Relay Policy Learning: Solving long-horizon tasks via mimicion and reinforcement learning”, arXiv preprint arXiv:1910.1 1956, 2019.”[16] A. Gupta, V. Kumar, C. Lynch, S. Levine, K. Hausman。 訳抜け防止モード: [16 ]A. Gupta, V. Kumar, C. Lynch, S. Levine 政策学習のリレー : K. Hausman 長いタスクを模倣と強化学習で解決する”。 arXiv preprint arXiv:1910.1 1956 , 2019
0.87
[17] C. Tessler, S. Givony, T. Zahavy, D. Mankowitz, and S. Mannor, “A deep hierarchical approach to lifelong learning in minecraft,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol.
17] c. tessler, s. givony, t. zahavy, d. mankowitz, s. mannor, “a deep hierarchy approach to lifelong learning in minecraft”, aaai conference on artificial intelligence, vol. の議事録。 訳抜け防止モード: 【17】c・テスラー、s・ジヴォニー、t・ザハヴィ d. mankowitzとs. mannorは、“minecraftの生涯学習への深い階層的アプローチ”だ。 aaai conference on artificial intelligence(aiaiの人工知能に関する会議)の議事録に書かれている。
0.53
31, no. 1, 2017.
背番号は2017年1月31日。
0.51
[18] J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser.
J. Andreas, D. Klein, S. Levine, “Modular multitask reinforcement learning with policy sketches” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ] 訳抜け防止モード: 18 ] j. andreas, d. klein, s. levine. 第34回機械学習国際会議(第70巻,ser)紀要「ポリシスケッチによるモジュール型マルチタスク強化学習」
0.59
ICML’17. JMLR.org, 2017, pp. 166–175.
ICML'17。 jmlr.org, 2017 pp. 166–175。
0.76
[19] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine, “Latent space policies for hierarchical reinforcement learning,” arXiv preprint arXiv:1804.02808, 2018.
T. Haarnoja, K. Hartikainen, P. Abbeel, S. Levine, “Latent Space Policy for Hierarchical reinforcement learning, arXiv preprint arXiv:1804.02808, 2018”。 訳抜け防止モード: [19 ]T. Haarnoja, K. Hartikainen, P. Abbeel, とS. Levine氏は述べている。 arXiv preprint arXiv:1804.02808 , 2018
0.77
[20] D. Esteban, L. Rozo, and D. G. Caldwell, “Hierarchical reinforcement learning for concurrent discovery of compound and composable policies,” arXiv preprint arXiv:1905.09668, 2019.
D. Esteban, L. Rozo, D. G. Caldwell, “Hierarchical reinforcement learning for concurrent discovery of compound and composable Policy” arXiv preprint arXiv: 1905.09668, 2019. 訳抜け防止モード: [20 ]D. Esteban, L. Rozo, D. G. Caldwell 「複合・構成可能な政策の同時発見のための階層的強化学習」 arXiv preprint arXiv: 1905.09668 , 2019
0.84
[21] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman, “Dynamics-aware unsupervised discovery of skills,” arXiv preprint arXiv:1907.01657, 2019.
A. Sharma, S. Gu, S. Levine, V. Kumar, K. Hausman, “Dynamics-aware unsupervised discovery of skills” arXiv preprint arXiv:1907.01657, 2019. 訳抜け防止モード: A. Sharma, S. Gu, S. Levine, V. Kumar, K. Hausman, “Dynamic - 教師なしのスキル発見を意識する”。 arXiv preprint arXiv:1907.01657 , 2019
[23] C. Florensa, Y. Duan, and P. Abbeel, “Stochastic neural networks for hierarchical reinforcement learning,” arXiv preprint arXiv:1704.03012, 2017.
C. Florensa, Y. Duan, P. Abbeel, “Stochastic Neural Network for Hierarchical reinforcement learning” arXiv preprint arXiv:1704.03012, 2017 訳抜け防止モード: [23]C.Florensa,Y. Duan,P. Abbeel 階層的強化学習のための確率的ニューラルネットワーク, arXiv preprint arXiv:1704.03012, 2017
0.86
[24] Y. Lee, S.
Y. Lee, S.
0.53
-H. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim, “Composing complex skills by learning transition policies,” in International Conference on Learning Representations, 2018.
-h。 sun, s. somasundaram, e. s. hu, j. j. lim両氏は2018年、international conference on learning representationsで、“トランジッションポリシーを学ぶことで複雑なスキルを構築する”と述べた。 訳抜け防止モード: -h。 sun, s. somasundaram, e. s. hu, j. j. lim, 2018年国際学習表現会議「トランジッション政策の学習による複雑なスキルの構成」に参加して
0.74
[25] Y. Lee, J. J. Lim, A. Anandkumar, and Y. Zhu, “Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization,” arXiv preprint arXiv:2111.07999, 2021.
Y. Lee, J. J. Lim, A. Anandkumar, Y. Zhu, “Adversarial skill chaining for long-horizon Robot operation through terminal state regularization”, arXiv preprint arXiv:2111.07999, 2021”。 訳抜け防止モード: Y. Lee, J. J. Lim, A. Anandkumar, そしてY. Zhu氏は,“端末状態正規化による長距離水平移動ロボット操作のためのアドバイザリスキルチェーン”だ。 arXiv preprint arXiv:2111.07999, 2021。
0.75
[26] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor, arXiv preprint arXiv:1801.01290, 2018”。 訳抜け防止モード: [26 ]T. Haarnoja, A. Zhou, P. Abbeel, そしてS・レヴィンは「ソフト・アクター - 批評家 : 確率的アクターによる極度のエントロピー深層強化学習」と評した。 arXiv preprint arXiv:1801.01290 , 2018。
0.63
[27] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al , “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al , “Soft actor-critic algorithm and applications” arXiv preprint arXiv:1812.05905, 2018. 訳抜け防止モード: (27)t.haarnoja,a.周,k.hartikainen, g. tucker, s. ha, j. tan, v. kumar, h. zhu, a. gupta, p. abbeel et al, "soft actor - critic algorithms and applications" arxiv プレプリント arxiv:1812.05905, 2018 。
0.64
[28] Vitchyr, “vitchyr/rlkit.” [Online].
Vitchyr, “vitchyr/rlkit.”[オンライン]
0.54
Available: https://github.com/v itchyr/
https://github.com/v itchyr/
0.29
[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.
機械学習に関する国際会議で、j. schulman氏、s. levine氏、p. abbeel氏、m. jordan氏、p. moritz氏が“trust region policy optimization”と題した講演を行った。 訳抜け防止モード: J. Schulman, S. Levine, P. Abbeel, M・ジョーダン、P・モリッツ両氏は「信頼地域政策の最適化」 機械学習に関する国際会議。
0.70
PMLR, 2015, pp. 1889–1897.
pmlr、2015年、p. 1889-1897。
0.53
rlkit VIII.
rlkit VIII。
0.61
APPENDIX 2 This section proves that using a convex combination of two critic functions can be used for the purposes of training a policy to solve a task.
A. Preliminaries Consider a task that has an action space A, a state space S, and a reward signal R. This task can be decomposed into a series of N subtasks such that each subtask has an action space An ⊆ A, a state space Sn ⊆ S, and a reward signal Rn ∈ [0, 1].
A.予備 アクション空間 A, 状態空間 S, 報酬信号 R を持つタスクを考える。このタスクは、各サブタスクがアクション空間 An > A, 状態空間 Sn > S, 報酬信号 Rn ∈ [0, 1] を持つような一連の N サブタスクに分解することができる。
0.75
Each subtask reward signal is defined by,
各サブタスクの報酬信号は、
0.54
Rn(s, a) =
Rn(s, a) =
0.42
if s ∈ Sn else if s ∈ Sm|m ∈ [1, n) else if s ∈ Sm|m ∈ (n, N ] The reward signal R for the overall task is defined as,
s ∈ Sn が他の場合、s ∈ Sm|m ∈ [1, n) が他の場合、s ∈ Sm|m ∈ (n, N ] 全体のタスクに対する報酬信号 R は、次のように定義される。
0.76
0, 1, ∈ [0, 1],
0, 1, ∈ [0, 1],
0.43
N(cid:88) VII.
N(第88回) VII。
0.71
APPENDIX 1 R(s, a) =
付録1 R(s, a) =
0.50
Rn(s, a) The implementation used in this paper was based on the RLkit implementation of SAC [28].
Rn(s, a) 本稿では,SAC[28]のRLkit実装に基づく実装について述べる。
0.54
This implementation utilises an epoch based approach.
この実装はエポックベースのアプローチを利用する。
0.55
All methods use the hyperparameters shown in Table II.
全ての方法は表IIに示すハイパーパラメータを使用する。
0.69
These hyperparameters were adapted from the RLkit SAC implementation [28].
これらのハイパーパラメータは、RLkit SAC実装[28]から適応された。
0.60
For each experiment, 5 random seeds were used per method and domain, with 3 used for the transition policies experiments.
It is recommended that if this algorithm is recreated, the batch size is at least as large as in this paper, to ensure that the convex combination of critics is effective across each batch.
It is also recommended that the discount factor γ is not increased above 0.95, to ensure the method doesn’t become unstable when estimating too far into the future.
The plots in this paper that show the performance of the TP algorithm only represent the training of the transition policies themselves.
tpアルゴリズムの性能を示すプロットは,トランジッションポリシ自体のトレーニングのみを表している。
0.46
The primitive policies that learned to solve each subtask were trained separately, and then each run that the TP algorithm underwent used this same set of primitive policies.
where π is a policy chosen to maximise the Q function and the environment transitions deterministically.
π は Q 関数を最大化するポリシーであり、環境遷移は決定論的に決定される。
0.74
i=t B. Theorem A convex combination of two normalised Q functions can be used in the place of the sum of two Q functions for the purposes of picking an optimal action.
The original maximisation objective can be recovered using the correct weighting.
元の最大化目標を正しい重み付けを用いて回収することができる。
0.70
A standard policy π selects actions that maximise a Q
標準ポリシー π は Q を最大化する作用を選択する
0.81
function such that, Q(s, π(s)) = max a∈A
機能するのです Q(s, π(s)) = max a∂A
0.49
Q(s, a), (10)
Q(s, a) (10)
0.39
This can be decomposed into two Q functions according to Lemma 1.
これは、Lemma 1 に従って2つの Q 函数に分解できる。
0.63
1) Lemma 1: A Q function for a task involving multiple sequential subtasks can be be approximated as the sum of two Q functions, each corresponding to the current and subsequent subtasks
The above formulation for a cooperative Q function uses the weighting variable η to represent both this natural weighting of time discounting of rewards, as well as the weighting that represents balancing between the current and subsequent subtasks.
上記の協調Q関数の定式化は、重み変数 η を用いて報酬の時間割引の自然な重み付けと、現在のサブタスクとその後のサブタスクのバランスを表す重み付けの両方を表す。
0.79
To remove this natural time discounting, a normalised cooperative Q function ˆCn is introduced.
この自然な時間のディスカウントを取り除くために、正規化された協調q関数が導入された。
0.52
ˆCn = η ˆQn(s, a) + (1 − η) ˆQn+1(s, a),
a, a) + (1 − η) sqn+1(s, a) である。
0.65
(15) where ˆQ is a normalisation such that,
(15) q はそのような正規化である。
0.54
ˆQ(s, a) =
~Q(s, a) =
0.43
Q(s, a) − min a∈A
Q(s, a) − min a∂A
0.41
Q(s, a) − min a∈A
Q(s, a) − min a∂A
0.41
max a∈A
max (複数形 maxs)
0.28
Q(s, a) Q(s, a)
Q(s, a) Q(s, a)
0.42
(16) To simplify this operation, the following functions are defined,
(16) この操作を単純化するために、以下の関数が定義される。
0.53
Ran(Q) = max a∈A
Ran(Q) = max a∂A
0.40
Q(s, a) − min a∈A
Q(s, a) − min a∂A
0.41
Q(s, a) M in(Q) = min a∈A
Q(s, a) M in(Q) = min a∂A
0.41
Q(s, a) (17)
Q(s, a) (17)
0.43
(18) Using these it can be shown that using the correct η can lead to the same action.