Fugu-MT 論文翻訳(概要): nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

論文の概要: nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

arxiv url: http://arxiv.org/abs/2511.14465v1
Date: Tue, 18 Nov 2025 13:05:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 16:23:53.124877
Title: nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers
Title（参考訳）: nnterp:トランスフォーマーの機械的解釈性のための標準化されたインタフェース
Authors: Clément Dumas,
Abstract要約: nnterpは、NNsightのトランスフォーマー分析のための軽量ラッパーである。オリジナルのHuggingFace実装を保持しながら、トランスフォーマー分析のための統一インターフェースを提供する。
参考スコア（独自算出の注目度）: 1.0152838128195467
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic interpretability research requires reliable tools for analyzing transformer internals across diverse architectures. Current approaches face a fundamental tradeoff: custom implementations like TransformerLens ensure consistent interfaces but require coding a manual adaptation for each architecture, introducing numerical mismatch with the original models, while direct HuggingFace access through NNsight preserves exact behavior but lacks standardization across models. To bridge this gap, we develop nnterp, a lightweight wrapper around NNsight that provides a unified interface for transformer analysis while preserving original HuggingFace implementations. Through automatic module renaming and comprehensive validation testing, nnterp enables researchers to write intervention code once and deploy it across 50+ model variants spanning 16 architecture families. The library includes built-in implementations of common interpretability methods (logit lens, patchscope, activation steering) and provides direct access to attention probabilities for models that support it. By packaging validation tests with the library, researchers can verify compatibility with custom models locally. nnterp bridges the gap between correctness and usability in mechanistic interpretability tooling.
Abstract（参考訳）: 機械的解釈可能性の研究は、様々なアーキテクチャでトランスフォーマー内部を解析するための信頼できるツールを必要とする。 TransformerLensのようなカスタム実装は、一貫したインターフェースを保証するが、各アーキテクチャへの手動適応をコーディングする必要がある。このギャップを埋めるために、NNsight を囲む軽量ラッパーであるnnterp を開発し、元の HuggingFace 実装を保存しながらトランスフォーマー解析のための統一インターフェースを提供する。モジュールの自動リネームと包括的なバリデーションテストを通じて、nnterpは16のアーキテクチャファミリにまたがる50以上のモデルにまたがって、介入コードを一度書くことができる。このライブラリには、共通の解釈可能性メソッド(ログレンズ、パッチスコープ、アクティベーションステアリング)の組み込み実装が含まれており、それをサポートするモデルに対するアテンション確率への直接アクセスを提供する。ライブラリに検証テストをパッケージ化することで、研究者はカスタムモデルとの互換性をローカルで検証できる。 nnterpは機械的解釈可能性ツールの正確性とユーザビリティのギャップを埋める。

論文の概要: nnterp: A Standardized Interface for Mechanistic Interpretability of Transformers

関連論文リスト