Fugu-MT 論文翻訳(概要): Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

論文の概要: Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

arxiv url: http://arxiv.org/abs/2310.17086v1
Date: Thu, 26 Oct 2023 01:08:47 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-27 22:44:08.488467
Title: Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Title（参考訳）: インテクスト学習のための高次最適化法を変換者が学習する:線形モデルによる検討
Authors: Deqing Fu, Tian-Qi Chen, Robin Jia, Vatsal Sharan
Abstract要約: In-context Learning を実現するために,Transformer が高次最適化手法の実装を学習していることを示す。 In-context on ill-conditioned data, is a setting where Gradient Descent struggles but Iterative Newton successfully。
参考スコア（独自算出の注目度）: 26.15757039132891
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers are remarkably good at in-context learning (ICL) -- learning from demonstrations without parameter updates -- but how they perform ICL remains a mystery. Recent work suggests that Transformers may learn in-context by internally running Gradient Descent, a first-order optimization method. In this paper, we instead demonstrate that Transformers learn to implement higher-order optimization methods to perform ICL. Focusing on in-context linear regression, we show that Transformers learn to implement an algorithm very similar to Iterative Newton's Method, a higher-order optimization method, rather than Gradient Descent. Empirically, we show that predictions from successive Transformer layers closely match different iterations of Newton's Method linearly, with each middle layer roughly computing 3 iterations. In contrast, exponentially more Gradient Descent steps are needed to match an additional Transformers layer; this suggests that Transformers have an comparable rate of convergence with high-order methods such as Iterative Newton, which are exponentially faster than Gradient Descent. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, we show theoretical results which support our empirical findings and have a close correspondence with them: we prove that Transformers can implement $k$ iterations of Newton's method with $\mathcal{O}(k)$ layers.
Abstract（参考訳）: トランスフォーマーは、ICL(In-context Learning) -- パラメータ更新のないデモから学ぶ -- が、ICLの実行方法はまだ謎のままだ。最近の研究は、トランスフォーマーが1次最適化法であるGradient Descentを内部で実行することで、コンテキスト内で学習できることを示唆している。そこで本稿では,トランスフォーマーがICLを実現するために高次最適化手法を実装することを実証する。 In-context linear regression に着目し,Transformer は高次最適化法である Iterative Newton's Method と非常によく似たアルゴリズムを実装することを学ぶ。実験により,連続するトランスフォーマー層からの予測はNewtonのメソッドの異なるイテレーションと線形に一致し,各中間層は3つのイテレーションを大まかに計算することを示した。これとは対照的に、付加的なトランスフォーマー層と一致するために指数関数的に勾配降下ステップが必要であり、これは変圧器が勾配降下よりも指数関数的に速い反復ニュートンのような高次法と同等の収束率を持つことを示唆している。また、Transformerは、Gradient Descentが苦戦するがIterative Newtonが成功するような環境で、不調なデータを使ってコンテキストで学習できることを示す。最後に、実験結果をサポートし、それらと密接な対応を持つ理論的結果を示す: トランスフォーマーは、$\mathcal{O}(k)$レイヤでNewtonのメソッドの反復を$k$で実装できることを証明する。

論文の概要: Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

関連論文リスト