Fugu-MT 論文翻訳(概要): Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

論文の概要: Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

arxiv url: http://arxiv.org/abs/2605.22005v1
Date: Thu, 21 May 2026 05:02:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.101266
Title: Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
Title（参考訳）: LLMの秘密辞書をチェック! LLMが学んだことを5行にまとめる
Authors: Hisashi Miyashita,
Abstract要約: 変圧器を用いた大言語モデルのlm_head重み行列の特異値分解により,モデルの重みから直接解釈可能な意味部分空間が明らかになることを示す。 GPT-OSS-120B, Gemma-2-2B, Qwen2.5-1.5Bを解析したところ, 特異値スペクトルと語彙クラスタ構造はモデルによって系統的に異なることがわかった。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five lines of PyTorch and no model inference -- reveals interpretable semantic subspaces directly from the model weights. Each left singular vector identifies the vocabulary tokens most readily selected when the hidden state aligns with the corresponding singular direction; inspecting these clusters exposes the model's training data composition and curation philosophy. Analysing GPT-OSS-120B, Gemma-2-2B, and Qwen2.5-1.5B, we find that singular value spectra and vocabulary cluster structures differ systematically across models: GPT exhibits a graduated hierarchy of functionally differentiated subspaces; Gemma is dominated by pre-nineteenth-century English orthography, forming a stepwise clustering structure that may contribute to high output controllability; and Qwen exhibits broad multilingual coverage alongside subspaces whose vocabulary the authors have determined to be ethically inappropriate for direct publication. Base-instruct comparison reveals that ethically concerning subspaces originate in pretraining and are not removed by post-training alignment. We introduce the Vocabulary Cluster Score (VCS) to quantify subspace coherence, and the Weighted Projection Score (WPS) as a static glitch token detector; applying WPS to GPT-OSS-120B recovers shokubutsu-hyakka-tsu (ID 137606), a well-known glitch token widely reported in the CJK language community, without any model inference. We propose a taxonomy of root causes for problematic vocabulary content and call for lm_head} SVD analysis to be adopted as a standard pre-release safety auditing step. Our findings further suggest directions toward SVD-guided tokenizer optimisation and more controllable LLM design.
Abstract（参考訳）: 変換器をベースとした大言語モデルのlm_head}重み行列の特異値分解は、PyTorchの5行しか必要とせず、モデル重みから直接解釈可能な意味的部分空間を明らかにする。各左特異ベクトルは、隠れた状態が対応する特異な方向と一致したときに最も容易に選択される語彙トークンを特定し、これらのクラスタを検査するとモデルのトレーニングデータ構成とキュレーション哲学が露出する。 GPTは機能的に分化した部分空間の高度な階層構造を示し、Gemmaは19世紀以前の英語の正書法に支配され、高い出力制御性に寄与する段階的なクラスタリング構造を形成し、Qwenは語彙が直接出版に適さないと判断された部分空間と共に幅広い多言語的カバレッジを示す。ベース・インストラクト比較により、サブスペースに関する倫理的考察は事前訓練に起因し、訓練後のアライメントによって除去されないことが明らかとなった。我々はサブスペースコヒーレンスを定量化するためにVocabulary Cluster Score(VCS)を導入し、WPS(Weighted Projection Score)を静的グリッチトークン検出器として導入し、GPT-OSS-120BにWPSを適用して、モデル推論なしでCJK言語コミュニティで広く報告されている有名なグリッチトークンである続物ハイアカツ(ID 137606)を回収する。本稿では,問題となる語彙内容に対する根本原因の分類法を提案し,標準リリース前の安全監査ステップとして,lm_head} SVD分析を推奨する。さらに,SVD誘導型トークン化器の最適化と,より制御可能なLCM設計への方向性が示唆された。

論文の概要: Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)

関連論文リスト