Humans are capable of adjusting to changing environments flexibly and
quickly. Empirical evidence has revealed that representation learning plays a
crucial role in endowing humans with such a capability. Inspired by this
observation, we study representation learning in the sequential decision-making
scenario with contextual changes. We propose an online algorithm that is able
to learn and transfer context-dependent representations and show that it
significantly outperforms the existing ones that do not learn representations
adaptively. As a case study, we apply our algorithm to the Wisconsin Card
Sorting Task, a well-established test for the mental flexibility of humans in
sequential decision-making. By comparing our algorithm with the standard
Q-learning and Deep-Q learning algorithms, we demonstrate the benefits of
adaptive representation learning.
Inspired by this observation, we study representation learning in the sequential decision-making scenario with contextual changes.
本研究は,文脈変化を伴う逐次意思決定シナリオにおける表現学習について考察する。
0.75
We propose an online algorithm that is able to learn and transfer contextdependent representations and show that it significantly outperforms the existing ones that do not learn representations adaptively.
As a case study, we apply our algorithm to the Wisconsin Card Sorting Task, a well-established test for the mental flexibility of humans in sequential decision-making.
By comparing our algorithm with the standard Q-learning and Deep-Q learning algorithms, we demonstrate the benefits of adaptive representation learning.
Real-world decision-making is complicated since environments are often complex and rapidly changing.
環境はしばしば複雑で、急速に変化するので、現実の意思決定は複雑です。
0.58
Yet, human beings have shown the remarkable ability to make good decisions in such environments.
しかし、人間はそのような環境で良い決断を下す素晴らしい能力を示してきた。
0.69
At the core of this ability is the flexibility to adapt their behaviors in different situations [1].
この能力の核心は、異なる状況で彼らの行動に適応するための柔軟性です[1]。
0.82
Such adaption is usually fast since humans learn to abstract experiences into compact representations that support the efficient construction of new strategies [2].
このような適応は通常、人間が経験を抽象的に表現し、新しい戦略の効率的な構築を支援するため、速い。
0.71
Lacking the ability to adapt to new environments and abstract compressed information from experiences, existing learning techniques often struggle in complex scenarios that undergo contextual changes.
The WCST is one of the most frequently used neuropsychological tests to assess people’s ability to abstract information and shift between contexts [3].
Illustrated in Fig. 1, participants are initially given four cards and are required to associate a sequence of stimulus cards with these four cards according to some sorting rules – number, color, and shape.
They receive a feedback indicating whether their sort action is correct or incorrect.
彼らは、ソートアクションが正しいか間違っているかを示すフィードバックを受け取る。
0.63
What makes the task more challenging is that the sorting rule changes every once in a while without informing the participants.
タスクをより難しくしているのは、ソートルールが参加者に知らせることなく、たまに変わることです。
0.67
Thus, the participants need to learn the changes and adjust their strategy.
したがって、参加者は変化を学び、戦略を調整する必要がある。
0.73
Y. Qin and F. Pasqualetti are with the Department of Mechanical Engineering ({yuzhenqin,fabiopas}@engr.ucr.edu), and S. Oymak is with the Department of Electrical and Computer Engineering (oymak@ece.ucr.edu), University of California, Riverside, CA, USA.
Here, we consider that participants receive reward 1 for a correct sorting action and 0 otherwise Each shaded area contains 20 realizations of the corresponding algorithm.
Healthy humans usually perform very well in the WCST.
健康な人間は通常、WCSTで非常によく機能する。
0.64
Some neuroimaging studies have found that different brain regions, such as the dorsolateral prefrontal cortex and the anterior cingulate cortex, play crucial roles in context shifting, error detection, and abstraction, all of which are needed by the WCST [4].
By contrast, classical learning algorithms such as tabular-Q-learning and Deep-Q-learning struggle in the WCST, especially when the sorting rule changes rapidly.
It can be seen from Fig 1 that standard reinforcement learning (RL) algorithms1 perform barely better than the strategy that takes random sorting actions at every round.
Motivated by these observations, we aim to develop decision-making strategies that have more human-like performance.
これらの観察により、より人間的なパフォーマンスを持つ意思決定戦略の開発を目指す。
0.64
In this paper, we focus on demonstrating the benefits of the ability to abstract compact information (i.e., learn the representation) and adapt to changing contexts in the framework of a sequential decision-making model – linear multi-armed bandits.
Various generalizations of the classical bandit problem have been studied, in which nonstationary reward functions [7], [8], restless arms [9], satisficing reward objectives [10], risk-averse decision-makers [11], heavy-tailed reward distributions [12], and multiple players [13] are considered.
Representation learning has been applied to a wide range of practical problems including natural language processing, computer vision, and reinforcement learning [18].
Given a matrix A ∈ Rm×n, span(A) denotes its column space, A⊥ denote the matrix with orthonormal columns that form the perpendicular complement of span(A), (cid:107)A(cid:107)F denotes its Frobenius norm, and [A]i denotes its ith column.
For any x ∈ R+, (cid:100)x(cid:101) denotes the smallest integer larger than x.
任意の x ∈ R+ に対して (cid:100)x(cid:101) は x より大きい最小の整数を表す。
0.76
Given two functions f, g : R+ → R+, we write f (x) = O(g(x)) if there is Mo > 0 and x0 > 0 such that f (x) ≤ Mog(x) for all x ≥ x0, and f (x) = ˜O(g(x)) if f (x) = O(g(x) logk(x)).
2つの関数 f, g : r+ → r+ が与えられると、f (x) = o(g(x)) と書くと、mo > 0 と x0 > 0 が存在して f (x) ≤ mog(x) がすべての x ≥ x0 に対して成立し、f (x) = o(g(x) logk(x)) が f (x) = o(g(x) logk(x)) であれば f (x) ≤ mog(x) となる。
0.78
Also, we denote f (x) = Ω(g(x)) if there is MΩ > 0 and x0 > 0 such that f (x) ≥ MΩg(x) for all x ≥ x0, and f (x) = Θ(g(x)) if f (x) = O(g(x)) and f (x) = Ω(g(x)).
また、すべての x ≥ x0 に対して f (x) ≥ MΩg(x) となるような MΩ > 0 と x0 > 0 が存在するとき、f (x) = Ω(g(x)) と f (x) = O(g(x)) および f (x) = Ω(g(x)) が成り立つとき、f (x) = Ω(x)) と書く。
0.93
Motivated by real-world tasks like the WCST, we consider
WCSTのような現実世界のタスクに動機付けられる
0.77
the following sequential decision-making model:
次の逐次意思決定モデル。
0.76
II. PROBLEM SETUP yt = x(cid:62)t θσ(t) + ηt,
II。 問題設定 yt = x(cid:62)t θσ(t) + ηt,
0.50
(1) where xt ∈ A ⊆ Rd is the action taken from the action set A at round t, and yt ∈ R is the reward received by the agent (i.e., decision maker).
1) xt ∈ A > Rd が円 t における作用集合 A から取られた作用であり、yt ∈ R がエージェント(すなわち意思決定者)から受け取った報酬である。
0.68
The reward depends on the action in a linear way determined by the unknown coefficient θσ(t), and is also affected by the 1-sub-Gaussian noise ηt that models the uncertainty.
In each Sk, there are nk (nk can be infinite) tasks , and we assume that they share a common θk 1 , . . . , . . . , θk nk linear feature extractor.
各 Sk には nk (nk は無限である) のタスクがあり、共通の θk 1 , . . . . . . . , θk nk の線型特徴抽出器を共有すると仮定する。
0.82
Different sets have different feature extractors.
異なる集合には異なる特徴抽出器がある。
0.64
Specifically, there is Bk ∈ Rd×ri with orthonori ∈ Rri so that mal vectors such that for any θk i (see Fig 2).
具体的には、正則 ∈ Rri を持つ Bk ∈ Rd×ri が存在し、任意の θk i に対して有理ベクトルとなる(図 2 参照)。
0.79
For simplicity, we assume that all i = Bkαk θk the extractors have the same dimension r, i.e., ri = r for all i.
単純性については、すべての i = Bkαk θk がすべての i に対して同じ次元 r,すなわち ri = r を持つと仮定する。
0.70
Here, each of these mutually different matrices B1, . . . , Bm are also referred to as a linear representation [26] for the tasks in the respective set.
For different contexts, participants usually need to abstract distinct low-dimensional features.
異なる文脈では、参加者は通常、異なる低次元の特徴を抽象化する必要がある。
0.48
Similar to the WCST in which participants do not know when the sorting rule changes, we further assume that the agent is not informed when θσ(t) starts to take values from a different task set.
The agent’s objective is then equivalent to minimizing the regret RSN .
エージェントの目的は、後悔するRSNを最小化することと等価である。
0.64
We next make some standard assumptions on the action set A and the task coefficients following existing studies (e g , see [27], [28]), which are considered to be satisfied throughout the remainder of this paper.
(a) the action set A is a unit ball centered at the origin, i.e., A := {x ∈ Rd : (cid:107)x(cid:107) ≤ 1}, and
a) 作用集合 A は原点を中心とする単位球、すなわち A := {x ∈ Rd : (cid:107)x(cid:107) ≤ 1} である。
0.78
(b) there are positive constants φmin and φmax so that φmin ≤ (cid:107)θs(cid:107) ≤ φmax for all s ∈ {1, 2, . . . , S}.
b) 正の定数 φmin と φmax が存在し、すべての s ∈ {1, 2, . , S} に対して φmin ≤ (cid:107)θs(cid:107) ≤ φmax となる。
0.85
Inspired by humans’ strategy, we seek to equip the agent with the ability to learn and exploit representations and to quickly adjust to contextual changes so that it can perform well even in complex environments with context changes.
Lemma 3.1 (Classical Lower Bound): Let P be the set of all policies, and I be the set of all the possible tasks.
Lemma 3.1 (古典的下界): P をすべてのポリシーの集合とし、私はすべての可能なタスクの集合とする。
0.78
Then, for any d ∈ Z+ and N > d2, the regret RN for the task (2) (cid:52) satisfies infP sup I This lemma indicates that there is a constant c > 0 such √ that the expected regret incurred by any policy is no less than N for any d ∈ Z+ and N > d2.
すると、任意の d ∈ Z+ および N > d2 に対して、タスク (2) (cid:52) に対する後悔 RN は infP sup I を満たす。 訳抜け防止モード: すると、任意の d ∈ Z+ と N > d2 に対して。 タスク ( 2 ) ( cid:52 ) に対する後悔 RN は infP sup I を満たす。 この補題は、任意のポリシーによって生じる予想された後悔が任意の d ∈ Z+ および N > d2 に対して N に満たないような定数 c > 0 が存在することを示している。
0.75
Next, we show how cd some additional information on θ affects this lower bound.
次に、この下界にθに関する追加情報がどのように影響するかを示す。
0.65
Lemma 3.2 (Lower Bound with a Representation): Suppose there is a known matrix B ∈ Rd×r with r < d such that θ = Bα for some α ∈ Rr.
Lemma 3.2 (Lwer Bound with a Representation):ある α ∈ Rr に対して θ = Bα となるような r < d を持つ既知の行列 B ∈ Rd×r が存在すると仮定する。
0.84
Let P be the set of all policies, and I be the set of all the possible tasks.
P をすべてのポリシーの集合とし、私はすべての可能なタスクの集合とする。
0.75
Then, for any d ∈ Z+ and N > d2, the regret RN for the task (2) (cid:52) satisfies infP sup I Proof: Let zt = B(cid:62)xt, and then the model in (2) becomes yt = z(cid:62)t α + ηt.
As a consequence, the problem becomes to deal with a task with dimension r instead of d.
結果として、問題は d の代わりに次元 r のタスクを扱うようになる。
0.71
Following similar steps as in [27], it can be shown that the E RN = √ minimax lower bound for the regret is infP sup I Ω(r Comparing Lemma 3.2 with Lemma 3.1, one finds that the regret lower bound decreases dramatically if r (cid:28) d.
27] でも同様な手順をとれば、後悔に対する e rn = s minimax lowerbound が infp sup i ω(r で補題 3.2 と補題 3.1 を比較すると、後悔の下限は r (cid:28) d で劇的に減少することが分かる。 訳抜け防止モード: 27 ] と同様のステップに従って、後悔に対する E RN = > minimax の下限は infP sup I Ω(r) であり、Lemma 3.2 と Lemma 3.1 を比較することができる。 r ( cid:28 ) d の場合、後悔の低い境界は劇的に減少する。
0.75
This is because, with the knowledge of the representation B ∈ Rd×r, one does not need to explore the entire Rd space to learn the task coefficient θ for decision-making.
Instead, one only needs to learn α by exploring a much lowerdimensional subspace span(B) and estimate θ by ˆθ = B ˆα.
代わりに、より低次元の部分空間 span(b) を探索し、θ = b で θ を推定することで α を学習するだけでよい。
0.73
As a consequence, θ can be learned much more efficiently, which helps the agent make better decisions at earlier stages.
その結果、θはより効率的に学習でき、エージェントがより早い段階でより良い決定を下すのに役立つ。
0.73
Yet, such a representation B is typically unknown beforehand.
しかし、そのような表現 B は、通常以前には未知である。
0.62
The agent usually needs to estimate B from its experiences before utilizing it.
エージェントは通常、使用前に経験からbを見積もる必要がある。
0.76
In the next subsection, we show how to explore and transfer the representation in the setting of sequential tasks.
次の節では、逐次タスクの設定における表現の探索と転送の仕方を示す。
0.65
B. Representation learning in sequential tasks
B.シーケンシャルタスクにおける表現学習
0.79
N ), which completes the proof.
これは証明を完了させる。
0.38
N ). Representation learning in the setting of sequential tasks is challenging, particularly when the agent has no knowledge of the number of sequential tasks that share the same representation.
There is a trade-off between the need to explore more tasks to construct a more accurate estimate of the underlying representation and the incentive to exploit the learned representation for more efficient learning and higher instant rewards.
To investigate how to balance the trade-off, we consider that the agent plays τ tasks in sequence, i.e., T = {θ1, θ2, . . . , θτ}, without knowing the number of tasks τ.
√ Input: Horizon N, exploration length N1 = (cid:100)d for t = 1 : N1 do compute ˆθ = (XreX(cid:62) Yre = [y1, . . . , yN1 ](cid:62); for t = N1 + 1 : N do
入力:Horizon N, exploration length N1 = (cid:100)d for t = 1 : N1 do compute >θ = (XreX(cid:62) Yre = [y1, . , yN1 ] (cid:62); for t = N1 + 1 : N do. 訳抜け防止モード: t = 1 : n1 に対する探索長さ n1 = (cid:100)d は、計算 sθ = ( xrex(cid:62 ) yre = [ y1, ..., である。 yn1 ] (cid:62 ) ; for t = n1 + 1 : n do
0.84
take xt = ai, i = (t − 1 mod d) + 1, where [a1, . . . , ad] is any orthonormal basis of Rd; re)−1XreYre, where Xre = [x1, . . . , xN1 ],
We propose an algorithm, the sequential representation learning algorithm (SeqRepL, see Algorithm 3), that alternates between two sub-algorithms – representation exploration (RE) and representation transfer (RT) algorithms.
本稿では,表現探索 (RE) と表現伝達 (RT) の2つのサブアルゴリズムを交互に行うアルゴリズム,シーケンシャル表現学習アルゴリズム (SeqRepL, See Algorithm 3) を提案する。
0.86
Let us first elaborate on these two sub-algorithms, respectively.
まず、これらの2つのサブアルゴリズムについて詳しく説明しましょう。
0.55
is an explore-then-commit (ETC) algorithm, which contains two phases: exploration and commitment, consisting of N1 and N − N1 rounds, respectively.
The central goal of RE is to construct an accurate ˆθi for each task so that the collection of ˆθi’s can recover an accurate representation ˆB (which will be shown soon).
RE の中心的な目的は、各タスクに対して正確な sθi を構築して sθi の集合が正確な表現 sB を復元できるようにすることである。
0.73
Meanwhile, we want to ensure that √ the algorithm does not incur too much regret.
一方、アルゴリズムがあまり後悔しないようにしたいと思っています。
0.57
To strike the balance, we set the exploration length N1 = (cid:100)d N(cid:101).
バランスをとるために、探索距離 N1 = (cid:100)d N(cid:101) を設定した。
0.71
The exploration phase is accomplished on the entire Rd space, in which d linearly independent actions are repeatedly taken in sequence.
探索フェーズはRd空間全体において達成され、d線型独立な動作が連続的に繰り返される。
0.76
Then, θ is estimated by the least-square regression ˆθ = (XreX(cid:62)re)−1XreYre, where Xre = [x1, . . . , xN1], Yre = [y1, . . . , yN1 ](cid:62).
With a perfect estimate ˆB = B, RT can achieve a regret N ).
B = B の完全推定では、RT は後悔 N を達成できる。
0.61
This can be proven straightupper bounded by O(r forwardly since the original model can be rewritten into a r-dimensional one yt = z(cid:62)t α + ηt by letting zt = ˆB(cid:62)xt.
Yet, constructing a perfect ˆB is usually impossible given the noisy environment.
しかし、うるさい環境を考えると、完全アーベルの構成は通常不可能である。
0.57
The next theorem provides an upper bound for the regret of RT when there is some error between ˆB and B. Theorem 3.3 (Upper Bound Given an Estimated Representation): Assume that an estimate ˆB of the true represen(cid:107)F ≤ ε.
This implies that the knowledge of an imperfect estimate of the representation improves the performance as long as it is sufficiently accurate (i.e., small (cid:107) ˆB(cid:62)B⊥ Proof of Theorem 3.3: Since θ = Bα, then the model becomes yt = x(cid:62)t Bα + ηt.
since ηt is an independent random variable with zero mean.
ηtは平均ゼロの独立確率変数なので
0.73
It can be derived (more details can be found in the extended version of this paper [29]) that E(cid:107)s1(cid:107 )2 ≤ 2cφ2 √ maxε2 for some constant c and E(cid:107)s2(cid:107 )2 ≤ r/ N. Combining E(cid:107)s1(cid:107 )2 maxε2.
Algorithm 3 Sequential Representation Learning (SeqRepL) Initialize: n = 1; RE phase: play L tasks in Sτ using RE algorithm, ˆP = ˆP + ˆθi ˆθ(cid:62) i , RT phase: play nL tasks in Sτ using RT algorithm with latest ˆB;
アルゴリズム3 逐次表現学習(SeqRepL) 初期化: n = 1; RE フェーズ: RE アルゴリズムを用いて Sτ の L タスクを再生する、i , RT フェーズ: RT アルゴリズムを使って Sτ の nL タスクを再生する。
0.71
ˆB ← top r singular vector of ˆP ;
p ; の位相 r 特異ベクトル。
0.43
update n = n + 1.
n = n + 1 を更新する。
0.83
ˆB is constructed by performing singular value decomposition (SVD) to ˆP in the following way:
b は、次の方法で特異値分解 (svd) を p に実行して構成される。
0.64
SVD : ˆP = [U1, U2]ΣV
SVD : >P = [U1, U2]ΣV
0.42
−→ ˆB = U1,
b = u1 である。
0.60
where the columns of U1 ∈ Rd×r are the singular vectors that are associated with the r-largest singular values of ˆP .
ここで、U1 ∈ Rd×r の列は R の r 最大の特異値に関連付けられた特異ベクトルである。
0.77
In the RT phase, nL tasks are played using RT with the estimated ˆB.
RT の段階では、nL のタスクは RT を用いて、推定値=B でプレイされる。
0.60
Notice that L more tasks are played using RT in each cycle than the previous one.
各サイクルでL以上のタスクがRTを使ってプレイされていることに注意してください。
0.59
This alternating scheme balances representation exploration and transfer well.
この交互スキームは、表現探索と移動をうまくバランスさせる。
0.46
. N √ √ (cid:17)
. N √ √ (cid:17)
0.42
τ rN + τ r
τ rN + τ r
0.42
for SeqRepL. Next, we make an assumption and provide an upper bound Assumption 3.4: For the task sequence T = {θ1, . . . , θτ}, there exists L = c1r for some constant suppose that c1 > 0 such that any subsequence of length L in T satisfies σr(WsW (cid:62)s ) ≥ ν > 0 for any s, where Ws = [θs+1, . . . , θs+(cid:96)] and σr(·) denote the rth largest singular (cid:52) value of a matrix.
seqrepl用。 タスク列 T = {θ1, . . , θτ} に対して、ある定数の仮定に対して L = c1r が存在して、T における長さ L の任意の部分列が σr(WsW (cid:62)s ) ≥ ν > 0 を満たすとき、Ws = [θs+1, . . . , θs+(cid:96)] および σr(·) は行列の r 番目の最大の特異点(cid:52)値を表す。
0.64
This assumption states that the sequential tasks covers all the directions of the r-dimensional subspace span(B), which ensures that B can be recovered in a sequential fashion.
Suppose that Assumption 3.4 is satisfied, then the regret, denoted by Rτ N , satisfies E Rτ N = (cid:52) ˜O d Note that if one uses a standard algorithm, e g , a UCB algorithm [25] or a PEGE algorithm [27], to play the sequence of tasks without learning the representation, the optimal regret would be Θ(τ d N ).
仮定 3.4 が満たされ、Rτ N で表される後悔が E Rτ N = (cid:52) >O d を満たすと仮定すると、標準的なアルゴリズム eg, UCB アルゴリズム [25] または PEGE アルゴリズム [27] を用いて表現を学習せずにタスクの列を再生すると、最適の後悔は (τ d N ) となる。
0.81
This bound is always larger than the two terms in our bound since τ > r2.
この境界は、τ > r2 以来、我々の境界の2つの項よりも常に大きい。
0.62
This indicates that our algorithm outperforms the standard algorithms that do not learn the representations.
これは,表現を学習しない標準的なアルゴリズムよりも優れていることを示す。
0.63
Proof of Theorem 3.5: After the RE phase of nth cycle in the SeqRepL algorithm, it can be derived (more details can be (cid:16) d found in the extended version of this paper [29]) that the estimate ˆB and the true representation B satisfy (cid:107) ˆB(cid:62)B⊥ (cid:107)F = ˜O .
定理の証明 3.5: seqrepl アルゴリズムにおける n サイクルの再位相の後、(より詳しくは (cid:16) d が本論文の拡張版 (29]) で見いだされるように、推定された b と真の表現 b が (cid:107) の (cid:62)b を満たすことを導出することができる(cid:107)f は (cid:107)f である)。 訳抜け防止モード: 定理3.5の証明 : SeqRepLアルゴリズムにおけるn次サイクルのRE相の後 導出できる(詳細は ( cid:16 ) d で、この論文[29 ] の延長版にある) 推定 B と真表現 B は (cid:107 ) > B(cid:62)B) (cid:107)F = >O を満たす。
0.77
The regret incurred in this phase of the √ nth cycle, denoted by RRE(n), satisfies RRE = O(Ld N ) Then, nL tasks are played in sequence utilizing the RT algorithm with input ˆB.
It follows from Lemma 3.3 that the regret in the RT phase of the nth cycle, denoted as RRT(n), satisfies E RRT(n) (cid:46) nLr √ N + d in the sequence of length τ since L ¯L + L ¯L( ¯L + 1)/2 ≥ τ.
Lemma 3.3 から従えば、n 次サイクルの RT 相における後悔は RRT(n) と表され、長さ τ の列において E RRT(n) (cid:46) nLr > N + d を満たす。
0.63
Summing up the regret in Phases 1 and 2 in every cycle, we
段階1と段階2の後悔を サイクルごとにまとめると
0.65
N ). Observe that there are at most ¯L = (cid:100)(cid:112)2τ /L(cid:101) cycles
n)であった。 L = (cid:100)(cid:112)2τ /L(cid:101) サイクルが最大である。
0.57
(cid:113) 1
(cid:113) 1
0.42
N + nLN d2 ν2
N + nLN d2 ν2
0.44
= ˜O(nLr nLd√N
=-o(nlr) ヌルダヌン
0.37
nLd√N (cid:17)
ヌルダヌン (cid:17)
0.39
√ √ √ ν 1
√ √ √ ν 1
0.43
英語(論文から抽出)
日本語訳
スコア
Algorithm 4 Outlier Detection (OD)
アルゴリズム4 外乱検出(OD)
0.74
Input: ˆB ∈ Rd×r, nod, generate a random orthonormal matrix Q ∈ R(d−r)×nod , and let M = ˆB⊥Q.
These actions are randomly generated in the perpendicular complement of span( ˆB).
これらの作用はスパンの垂直補数においてランダムに生成される。
0.65
Specifically, we generate a random orthonormal matrix Q ∈ R(d−r)×nod.
具体的には、ランダムな正則行列 Q ∈ R(d−r)×nod を生成する。
0.63
The probing actions are taken from the columns of the matrix M = δ ˆB⊥Q, where δ > 0 ensures that the actions are within the action set A. If the current task θ satisfies θ = ˆBα for some α, it holds that yt = x(cid:62)t θ+ηt = ηt ˆBα = 0.
Therefore, if the received rewards since Q(cid:62) ˆB(cid:62) ⊥ considerably deviate from the level of noise, the new task is an outlier to the current context (i.e., a task that does not lie in the subspace span(B)) with high probability.
If the observed Yod is beyond Cnod, we decide that the new task is an outlier.
観測されたヨードが Cnod を超えている場合、新しいタスクは外れ値であると判断する。
0.63
The main algorithm in this paper, which we call Adaptive Representation Learning algorithm (AdaRepL), is provided in Algorithm 5, which invokes both SeqRepL and OD subalgorithms.
本稿では,seqrepl と od subalgorithms の両方を呼び出すアルゴリズム5において,適応表現学習アルゴリズム (adarepl) と呼ばれる主アルゴリズムを提案する。
0.80
The former well balances representation exploration and transfer in the sequential setting, and the latter enables the algorithm to adapt to changing environments.
To make our algorithm robust to occasional outliers, we set a threshold kc so that the algorithm considers that a context switch has occurred only when kc outliers have been detected consecutively.
Right: performance comparison between our algorithm and standard RL algorithms in WCST.
右:WCSTにおける我々のアルゴリズムと標準RLアルゴリズムのパフォーマンス比較。
0.88
Sorting rules change every 20 rounds.
20ラウンドごとにルールが変わる。
0.71
Dotted circles indicate that our algorithm is able to adapt to new contexts and learn new representations quickly.
点円は、我々のアルゴリズムが新しい文脈に適応し、新しい表現を素早く学習できることを示している。
0.68
It is worth mentioning that with the aid of the OD algorithm, the agent can detect context changes with high probability by properly selecting the detection threshold ξod and the length of probing actions nod.
Within each context, the regret of AdaRepL has an upper bound presented in Theorem 3.5.
各文脈において、AdaRepL の後悔は Theorem 3.5 に表される上限を持つ。
0.70
Although context change detection incurs some regret, the overall performance will still surpass the standard algorithms that are unable to learn representations adaptively.
For the tabular-Q learning, the problem is to construct the 43×4 Q table.
表Q学習では、43×4のQテーブルを構築することが問題となる。
0.68
This is because there are 43 possible stimulus cards (4 colors, 4 numbers, 4 shapes) and each stimulus card can be taken as a state, and there are 4 sorting actions.
Also, being unaware of the sorting rule changes worsens the performance.
また、ソートルールに気付いていないとパフォーマンスが悪化する。
0.57
Next, we demonstrate how our proposed algorithm, which explore and exploit the representation in the WCST and detect sorting rule changes, has a much better performance.
To do that, we model the WCST into a sequential decisionmaking model.
そのために、WCSTをシーケンシャルな意思決定モデルにモデル化する。
0.69
Specifically, we use a matrix At ∈ R4×3 to describe the stimulus card at round t.
具体的には、円 t における刺激カードを記述するために行列 At ∈ R4×3 を用いる。
0.66
The first, second, and third columns of At represent shape, number, and color, respectively, and they take values from the set {e1, e2, e3, e4} with ei being the ith standard basis of R4.
At の第1列、第2列、第3列はそれぞれ形状、数、色を表し、e1, e2, e3, e4} から値を取り、ei は R4 の i 番目の標準基底である。
0.76
In each column, ei indicates that this card has the same shape/number/color
For example, the stimulus card (with two green circles) in Fig 1 can be represented by the matrix A = [e1, e2, e3] (see Fig 3).
例えば、図1の刺激カード(緑の円が2つある)は行列 A = [e1, e2, e3] で表すことができる(図3参照)。
0.79
Moreover, we use a standard unit vector Bσ, which takes values from {b1, b2, b3} with bi being the standard basis of R3, to respectively describe the 3 sorting rules – shape, number, and color.
Here the unit vector Bσ can be taken as the current representation since the correct sorting action can always be computed by x∗t = AtBσ no matter what card the agent sees.
For instance, suppose the rule is number (i.e., Bσ = b2), if the agent sees the stimulus card with two green circles, i.e., A = [e1, e2, e3], then correct sort is the second card on table since it can be computed that x∗t = AtBσ = [0, 1, 0, 0](cid:62).
The problem then reduces to learn the underlying representation Bσ, a task that is much easier than constructing the Q table or training the weights in a Deep-Q network.
Remarkably, one does not even need to learn individual Instead, Bσ can be recovered by θt t=1 A(cid:62)t xtyt immediately aft=1 A(cid:62)t xtx(cid:62)t At becomes invertible.
This indicates that our idea in this paper can apply to more general situations.
このことから,本論文のアイデアはより一般的な状況に適用できることが示唆された。
0.56
It can be observed in Fig 3 that our algorithm significantly outperforms the other two, which demonstrates the power of being able to abstract compact representations and adapt to new environments.
Inspired by strategies taken by humans, we propose an online algorithm that is able to learn and transfer representations under the sequential setting and has the ability to adapt to changing contexts.
REFERENCES [1] A. Radulescu, Y. S. Shin, and Y. Niv, “Human representation learning,”
参考 [1] A. Radulescu, Y. S. Shin, Y. Niv, “Human representation learning”
0.47
Annual Review of Neuroscience, vol.
神経科学年報, vol.
0.60
44, no. 1, pp. 253–273, 2021.
44, No. 1, pp. 253–273, 2021。
0.95
[2] N. T. Franklin and M. J. Frank, “Generalizing to generalize: humans flexibly switch between compositional and conjunctive structures during reinforcement learning,” PLoS Computational Biology, vol.
-L. Chang, and K. F. Berman, “Metaanalysis of neuroimaging studies of the wisconsin card-sorting task and component processes,” Human Brain Mapping, vol.
-L。 Chang, and K. F. Berman, “Meta Analysis of neuroimaging studies of wisconsin card-sorting task and component process”. Human Brain Mapping, vol. 訳抜け防止モード: -L。 Chang氏とK.F. Berman氏は,“Wisconsinカードのニューロイメージング研究のメタアナリシス – タスクのソートとコンポーネントプロセス”だ。 人間の脳マッピング。
0.56
25, no. 1, pp. 35–45, 2005.
第25巻第1号、2005年、35-45頁。
0.46
[4] C.
C (複数形 Cs)
0.47
-H. Lie, K. Specht, J. C. Marshall, and G. R. Fink, “Using fMRI to decompose the neural processes underlying the Wisconsin Card Sorting Test,” Neuroimage, vol.
-h。 Lie, K. Specht, J. C. Marshall, そしてG. R. Finkは、“fMRIを使ってウィスコンシンカードソルティングテストの基礎となる神経プロセスを分解する”、とNeuroimageは言う。
0.75
30, no. 3, pp. 1038–1049, 2006.
30, No. 3, pp. 1038–1049, 2006。
0.46
[5] P. Auer, “Using confidence bounds for exploitation-explora tion tradeoffs,” Journal of Machine Learning Research, vol.
Journal of Machine Learning Research, vol.[5] P. Auer, “Eccess-Exploration tradeoffsの信頼性バウンダリを使用する”。
0.79
3, no. Nov, pp. 397–422, 2002.
3位はノー。 2002年、p.397-422。
0.70
[6] Y. Abbasi-Yadkori, D. P´al, and C. Szepesv´ari, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, vol.
Y. Abbasi-Yadkori, D. P al, and C. Szepesv ́ari, “Improved algorithm for linear stochastic bandits” in Advances in Neural Information Processing Systems, vol。 訳抜け防止モード: [6 ]Y. Abbasi-Yadkori, D. P al, そしてC. Szepesv ́ariは、“線形確率的包帯に対する改良アルゴリズム”だ。 In Advances in Neural Information Processing Systems, vol.
0.74
11, 2011, pp. 2312–2320.
2011年11月、p.2312-2320。
0.63
[7] Y. Russac, C. Vernade, and O. Capp´e, “Weighted linear bandits for non-stationary environments,” in Advances in Neural Information Processing Systems, vol.
7] y. russac, c. vernade, o. capp ́e, “weighted linear bandits for non-stationary environments” ニューラル情報処理システムにおける進歩。
0.71
32. Curran Associates, Inc., 2019.
32. curran associates, inc.、2019年。
0.48
[8] L. Wei and V. Srivastava, “Nonstationary stochastic multiarmed and minimax regret,” arXiv preprint
[8]L. Wei, V. Srivastava, “Nonstationary stochastic multiarmed and minimax regret”, arXiv preprint
0.44
bandits: UCB policies arXiv:2101.08980, 2021.
盗賊: UCB Policy arXiv:2101.08980, 2021
0.72
[9] T. Gafni and K. Cohen, “Learning in restless multiarmed bandits via adaptive arm sequencing rules,” IEEE Transactions on Automatic Control, vol.
ieee transactions on automatic control, vol. “adaptive arm sequencing rules”[9] t. gafni氏とk. cohen氏は次のように述べている。
0.70
66, no. 10, pp. 5029–5036, 2021.
66, no. 10, pp. 5029-5036, 2021。
0.92
[10] P. Reverdy, V. Srivastava, and N. E. Leonard, “Satisficing in multiarmed bandit problems,” IEEE Transactions on Automatic Control, vol.
10] P. Reverdy, V. Srivastava, N. E. Leonard, “Satisficing in multiarmed bandit problem, IEEE Transactions on Automatic Control, vol.
0.43
62, no. 8, pp. 3788–3803, 2016.
62, No. 8, pp. 3788–3803, 2016
0.43
[11] M. Malekipirbazari and O. Cavus, “Risk-averse allocation indices for multi-armed bandit problem,” IEEE Transactions on Automatic Control, 2021, in Press.
IEEE Transactions on Automatic Control, 2021, in Press.[11]M. Malekipirbazari, O. Cavus, “Risk-averse allocation indices for multi-armed bandit problem, IEEE Transactions on Automatic Control, 2021. 訳抜け防止モード: [11 ]M. Malekipirbazari と O. Cavus は「多武装盗賊問題に対するリスク - 逆割当指標」である。 IEEE Transactions on Automatic Control , 2021 , in Press .
0.82
[12] L. Wei and V. Srivastava, “Minimax policy for heavy-tailed bandits,”
12]l.weiとv. srivastava, “minimax policy for heavy-tailed bandits”
0.38
IEEE Control Systems Letters, vol.
ieee control systems letters (英語)
0.50
5, no. 4, pp. 1423–1428, 2020.
5, No. 4, pp. 1423–1428, 2020。
0.48
[13] M. K. Hanawal and S. Darak, “Multi-player bandits: A trekking approach,” IEEE Transactions on Automatic Control, 2021, in Press.
ieee transactions on automatic control, 2021, in press.[13] m. k. hanawalとs. darakは、こう書いている。 訳抜け防止モード: 13 ]m.k.hanawal と s. darak は "multi- player bandits: a trekking approach" と題した。 ieee transactions on automatic control, 2021, in press。
0.66
[14] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol.
14] D. Kalathil, N. Nayyar, R. Jain, “Decentralized learning for multiplayer multiarmed bandits”, IEEE Transactions on Information Theory, vol。 訳抜け防止モード: [14 ] D. Kalathil, N. Nayyar, R. Jain. マルチプレイヤーのマルチアームバンディットのための分散学習,IEEE Transactions on Information Theory, vol.
0.82
60, no. 4, pp. 2331–2345, 2014.
60, no. 4, pp. 2331-2345, 2014 頁。
0.82
[15] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed cooperative decision making in multi-agent multi-armed bandits,” Automatica, vol.
[15]p.landgren, v. srivastava, n. e. leonard, “マルチエージェントのマルチアーム付きバンディットにおける共同意思決定を分散化”した。 訳抜け防止モード: 15] P. Landgren, V. Srivastava, N. E. Leonard. 「多エージェント多武装盗賊における分散協力的意思決定」 Automatica, vol。
0.82
125, p. 109445, 2021.
125,p.109445,2021。
0.83
[16] U. Madhushani and N. E. Leonard, “A dynamic observation strategy for multi-agent multi-armed bandit problem,” in 2020 European Control Conf.
[16] u. madhushani氏とn. e. leonard氏は、2020年のeuropean control confで、“マルチエージェントマルチアームドバンディット問題のダイナミックな観察戦略”を発表した。 訳抜け防止モード: 16 ] u. madhushani と n. e. leonard, “a dynamic observation strategy for multi- agent multi- armed bandit problem” 2020年欧州委員会委員。
0.82
, 2020, pp. 1677–1682.
原書、2020年、p.1677-1682。
0.49
[17] J. Zhu and J. Liu, “A distributed algorithm for multi-armed bandit with homogeneous rewards over directed graphs,” in American Control Conference, 2021, pp. 3038–3043.
J. Zhu, J. Liu, “A distributed algorithm for multi-armed bandit with homogeneous rewards over directed graphs” in American Control Conference, 2021, pp. 3038–3043。 訳抜け防止モード: J. Zhu 氏と J. Liu 氏は,“有向グラフに対する均質な報酬を持つマルチ武装バンディットの分散アルゴリズム” だ。 The American Control Conference, 2021 , pp. 3038–3043.
0.85
[18] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.
Y. Bengio, A. Courville, P. Vincent, “Representation Learning: A review and new perspectives”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.[18] Y. Bengio, A. Courville, P. Vincent, 訳抜け防止モード: [18 ]Y・ベンジオ、A・クールヴィル、P・ヴィンセント。 表現学習 : IEEE Transactions on Pattern Analysis and Machine Intelligence,vol
0.54
35, no. 8, pp. 1798–1828, 2013.
35, No. 8, pp. 1798–1828, 2013
0.43
[19] N. Tripuraneni, C. Jin, and M. I. Jordan, “Provable meta-learning of
N. Tripuraneni, C. Jin, M. I. Jordan, “Provable meta-learning of”
0.44
linear representations,” arXiv preprint arXiv:2002.11684, 2020.
線形表現” arxiv preprint arxiv:2002.11684, 2020。
0.62
[20] S. Lale, K. Azizzadenesheli, A. Anandkumar, and B. Hassibi, “Stochastic linear bandits with hidden low rank structure,” arXiv preprint arXiv:1901.09490, 2019.
S. Lale, K. Azizzadenesheli, A. Anandkumar, B. Hassibi, “Stochastic linear bandits with hidden low rank structure” arXiv preprint arXiv:1901.09490, 2019. 訳抜け防止モード: 20] s. lale, k. azizzadenesheli, a. anandkumar, b. hassibi氏: “隠れた低ランク構造を持つ確率線形バンディット” arxiv プレプリント arxiv:1901.09490, 2019 。
0.61
[21] K. -S.
[21]K。 -S。
0.39
Jun, R. Willett, S. Wright, and R. Nowak, “Bilinear bandits with low-rank structure,” in International Conference on Machine Learning, 2019, pp. 3163–3172.
Jun, R. Willett, S. Wright, and R. Nowak, “Bilinear bandits with Low-rank structure” in International Conference on Machine Learning, 2019, pp. 3163–3172。 訳抜け防止モード: Jun, R. Willett, S. Wright, R. Nowak 国際機械学習会議における「低位二線帯」 2019 , pp . 3163–3172 .
0.74
[22] Y. Lu, A. Meisami, and A. Tewari, “Low-rank generalized linear bandit
[22]Y. Lu, A. Meisami, A. Tewari, “Low-rank generalized linear bandit”
0.44
problems,” arXiv preprint arXiv:2006.02948, 2020.
arXiv preprint arXiv:2006.02948, 2020。
0.84
[23] J. Yang, W. Hu, J. D. Lee, and S. S. Du, “Impact of representation learning in linear bandits,” in International Conference on Learning Representations, 2021.
J. Yang, W. Hu, J. D. Lee, S. S. Du, “Impact of representation learning in linear bandits” in International Conference on Learning Representations, 2021 訳抜け防止モード: [23 ]J. Yang, W. Hu, J. D. Lee S.S. Duは「リニアバンディットにおける表現学習の影響」と述べた。 国際学習表現会議(2021年)に参加。
0.71
[24] M. G. Azar, A. Lazaric, and E. Brunskill, “Sequential transfer in multi-armed bandit with finite set of models,” in Advances in Neural Information Processing Systems, 2013, p. 2220–2228.
24] m. g. azar, a. lazaric, e. brunskill, “sequential transfer in multi-armed bandit with finite set of models” in advances in neural information processing systems, 2013, p. 2220–2228. (英語) 訳抜け防止モード: [24 ]M. G. Azar, A. Lazaric, E. Brunskill 『有限モデルによる多武装バンディットの連続移動』 In Advances in Neural Information Processing Systems, 2013, pp. 2220–2228。
0.81
[25] V. Dani, T. P. Hayes, and S. M. Kakade, “Stochastic linear optimization
V. Dani, T. P. Hayes, S. M. Kakade, “Stochastic linear optimization”
0.42
under bandit feedback,” 2008.
2008年、アルバム『under bandit feedback』をリリース。
0.41
[26] J. Hu, X. Chen, C. Jin, L. Li, and L. Wang, “Near-optimal representation learning for linear bandits and linear RL,” arXiv preprint arXiv:2102.04132, 2021.
J. Hu, X. Chen, C. Jin, L. Li, L. Wang, “Near-Optitimal representation learning for linear bandits and linear RL” arXiv preprint arXiv:2102.04132, 2021。 訳抜け防止モード: 【26】j・h・x・チェン・c・ジン l. li, and l. wang, “neal- optimal representation learning for linear bandits and linear rl” arxiv プレプリント arxiv:2102.04132 , 2021 。
0.71
[27] P. Rusmevichientong and J. N. Tsitsiklis, “Linearly parameterized bandits,” Mathematics of Operations Research, vol.
[27]p. rusmevichientong と j. n. tsitsiklis, “linearly parameterized bandits”, mathematics of operations research, vol. (英語) 訳抜け防止モード: [27 ]P. Rusmevichientong, J. N. Tsitsiklis, “linearly parameterized bandits” 運用数学研究部、第4部。
0.78
35, no. 2, pp. 395– 411, 2010.
35, no. 2, pp. 395–411, 2010。
0.42
[28] Y. Li, Y. Wang, X. Chen, and Y. Zhou, “Tight regret bounds for infinite-armed linear contextual bandits,” in International Conference on Artificial Intelligence and Statistics.
[28] y. li, y. wang, x. chen, y. zhou, “tight regret bounds for infinite-armed linear context bandits” は、人工知能と統計に関する国際会議で発表された。 訳抜け防止モード: [28 ] Y. Li, Y. Wang, X. Chen, Y. Zhou, “Tit regrets for infinite - armed linear contextual bandits”. 人工知能と統計に関する国際会議」に参加。
0.86
PMLR, 2021, pp. 370–378.
PMLR, 2021, pp. 370-378。
0.85
[29] Y. Qin, T. Menara, S. Oymak, S. Ching, and F. Pasqualetti, “Nonstationary representation learning in sequential linear bandits,” arXiv preprint arXiv:2201.04805, 2022.
Y. Qin, T. Menara, S. Oymak, S. Ching, F. Pasqualetti, “Nonstationary representation learning in sequence linear bandits, arXiv preprint arXiv:2201.04805, 2022。 訳抜け防止モード: [29]y.qin,t.menara,s. oymak, s. ching, and f. pasqualetti, “シーケンシャルリニアバンディットにおける非定常表現学習” arxiv プレプリント arxiv:2201.04805 , 2022 。