Fugu-MT 論文翻訳(概要): The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

論文の概要: The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

arxiv url: http://arxiv.org/abs/2606.08792v1
Date: Sun, 07 Jun 2026 19:17:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.450989
Title: The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model
Title（参考訳）: 増幅ミラー:大規模言語モデル内のパルチザン方向の位置とステアリング
Authors: Wendy K. Tam,
Abstract要約: モデルの活性化空間において、パルチザンの政治的アイデンティティがコード化されていることを示す。米国議会の現職議員からのツイート190,491件をラベル付きトレーニングデータとして使用し、Llama 3.1 8Bインストラクトモデルの隠れ状態に関する線形プローブを訓練する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are rapicly replacing search engines as the primary interface between people and information. Unlike search engines, which retrieve existing content, LLMs generate novel text shaped by internal representations learned during training. Here we show that partisan political identity is encoded in the model's activation space, and that this direction directly shapes generation. Using 190,491 tweets from sitting members of the U.S. Congress as labeled training data, we train linear probes on the hidden states of the Llama 3.1 8B Instruct model. We identify a single geometric axis at layer 18 that separates Republican from Democratic text with an AUC of 0.945 and a Cohen's d of 1.94, and use sparse autoencoders to decompose that axis into interpretable partisan features. Causally intervening along this axis, ablating or amplifying the partisan component mid-generation, produces systematic shifts in the model's output. We witness stance reversals, register shifting, and structured fabrications of authority. Our results demonstrate that partisan bias in language models is not a vague emergent property but a learned geometric feature that can be precisely located and steered. Partisan bias is not a bug to be patched, but a structural property of how these models encode information about their users. As LLMs displace search engines as the interface to knowledge, understanding that product design (and its consequences) will be essential for navigating the legal, social, and political transitions from an information ecosystem that is curated to one that is generated.
Abstract（参考訳）: 大規模な言語モデルは、人と情報の主なインターフェースとして、検索エンジンを著しく置き換えている。既存のコンテンツを検索する検索エンジンとは異なり、LLMはトレーニング中に学習した内部表現によって形成された新しいテキストを生成する。ここでは、このモデルの活性化空間において、パルチザン的な政治的アイデンティティが符号化され、この方向が生成を直接生成することを示す。米国議会の現職議員からのツイート190,491件をラベル付きトレーニングデータとして使用し、Llama 3.1 8Bインストラクトモデルの隠れ状態に関する線形プローブを訓練する。我々は、共和党を民主党のテキストから0.945のAUCと1.94のCohenのdで分離する18層に1つの幾何学軸を同定し、スパースオートエンコーダを用いて、その軸を解釈可能なパルチザン特徴に分解する。この軸に沿って因果的に介入し、パルチザンの中間世代を非難または増幅し、モデルの出力に体系的な変化をもたらす。我々は、スタンス・リバーサル、レジスタ・シフト、および権限の構造化を目撃する。以上の結果から,言語モデルにおける偏見バイアスは曖昧な創発的特性ではなく,正確な位置と操舵が可能な幾何学的特徴であることが示唆された。パーティショナリズムバイアスはパッチを当てるべきバグではなく、これらのモデルがユーザに関する情報をエンコードする方法の構造的特性である。 LLMがサーチエンジンを知識のインターフェースに置き換えるにつれ、製品設計(とその結果)が、法、社会的、政治的移行を、キュレートされた情報エコシステムから生成されるものへとナビゲートする上で不可欠であると理解するようになる。

論文の概要: The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

関連論文リスト