Fugu-MT 論文翻訳(概要): What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

論文の概要: What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

arxiv url: http://arxiv.org/abs/2508.07208v1
Date: Sun, 10 Aug 2025 07:03:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.75403
Title: What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains
Title（参考訳）: 2層トランスフォーマーは、どんなマルコフ鎖でも誘導ヘッドを表現できる
Authors: Chanakya Ekbote, Marco Bondaschi, Nived Rajaraman, Jason D. Lee, Michael Gastpar, Ashok Vardhan Makkuva, Paul Pu Liang,
Abstract要約: インコンテキスト学習(ICL)は、入力コンテキストからの情報を活用することで、訓練されたモデルが新しいタスクに適応することを学習するトランスフォーマーの能力である。 1層に1つの頭を持つ2層トランスは、実際に任意の条件k-gramを表現可能であることを示す。
参考スコア（独自算出の注目度）: 64.31313691823088
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-context learning (ICL) is a hallmark capability of transformers, through which trained models learn to adapt to new tasks by leveraging information from the input context. Prior work has shown that ICL emerges in transformers due to the presence of special circuits called induction heads. Given the equivalence between induction heads and conditional k-grams, a recent line of work modeling sequential inputs as Markov processes has revealed the fundamental impact of model depth on its ICL capabilities: while a two-layer transformer can efficiently represent a conditional 1-gram model, its single-layer counterpart cannot solve the task unless it is exponentially large. However, for higher order Markov sources, the best known constructions require at least three layers (each with a single attention head) - leaving open the question: can a two-layer single-head transformer represent any kth-order Markov process? In this paper, we precisely address this and theoretically show that a two-layer transformer with one head per layer can indeed represent any conditional k-gram. Thus, our result provides the tightest known characterization of the interplay between transformer depth and Markov order for ICL. Building on this, we further analyze the learning dynamics of our two-layer construction, focusing on a simplified variant for first-order Markov chains, illustrating how effective in-context representations emerge during training. Together, these results deepen our current understanding of transformer-based ICL and illustrate how even shallow architectures can surprisingly exhibit strong ICL capabilities on structured sequence modeling tasks.
Abstract（参考訳）: In-context Learning(ICL)は、入力コンテキストからの情報を活用することで、訓練されたモデルが新しいタスクに適応することを学習するトランスフォーマーの目印機能である。以前の研究では、インダクションヘッドと呼ばれる特別な回路が存在するため、ICLがトランスフォーマーに現れることが示されている。誘導ヘッドと条件k-gramの等価性を考えると、マルコフ過程のように逐次入力をモデル化する最近の作業行では、モデル深さがICL能力に与える影響が明らかにされている。しかし、高次のマルコフ情報源では、最もよく知られた構成は少なくとも3つの層(それぞれ1つの注意頭を持つ)を必要とする。本稿では,これを正確に解決し,各層に1つの頭部を持つ2層トランスが任意の条件k-gramを表現できることを理論的に示す。そこで本研究では,ICLにおける変圧器深さとマルコフ次数との相互作用を,最も厳密に評価した。これに基づいて、2層構成の学習力学を更に分析し、一階マルコフ連鎖の簡易な変種に着目し、トレーニング中にコンテキスト内表現がいかに有効であるかを考察する。これらの結果は、トランスフォーマーベースのICLの現在の理解を深め、浅層アーキテクチャでさえ、構造化シーケンスモデリングタスクにおいて驚くほど強力なICL機能を示すことを示す。

論文の概要: What One Cannot, Two Can: Two-Layer Transformers Provably Represent Induction Heads on Any-Order Markov Chains

関連論文リスト