Fugu-MT 論文翻訳(概要): Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

論文の概要: Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders

arxiv url: http://arxiv.org/abs/2506.23951v1
Date: Mon, 30 Jun 2025 15:18:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 21:27:54.119757
Title: Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Title（参考訳）: テキスト分類のためのLLMの解答決定過程 : スパースオートエンコーダを用いた影響力と解釈可能な概念の抽出
Authors: Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier,
Abstract要約: 本稿では,テキスト分類に適した新しいSAEアーキテクチャを提案する。我々はこのアーキテクチャを、ConceptShap、Independent Component Analysis、その他のSAEベースの概念抽出技術といった確立した手法と比較した。私たちのアーキテクチャは,抽出した特徴の因果性と解釈性の両方を改善している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse Autoencoders (SAEs) have been successfully used to probe Large Language Models (LLMs) and extract interpretable concepts from their internal representations. These concepts are linear combinations of neuron activations that correspond to human-interpretable features. In this paper, we investigate the effectiveness of SAE-based explainability approaches for sentence classification, a domain where such methods have not been extensively explored. We present a novel SAE-based architecture tailored for text classification, leveraging a specialized classifier head and incorporating an activation rate sparsity loss. We benchmark this architecture against established methods such as ConceptShap, Independent Component Analysis, and other SAE-based concept extraction techniques. Our evaluation covers two classification benchmarks and four fine-tuned LLMs from the Pythia family. We further enrich our analysis with two novel metrics for measuring the precision of concept-based explanations, using an external sentence encoder. Our empirical results show that our architecture improves both the causality and interpretability of the extracted features.
Abstract（参考訳）: スパースオートエンコーダ (SAE) は、Large Language Models (LLM) を探索し、内部表現から解釈可能な概念を抽出するのに成功している。これらの概念は、人間の解釈可能な特徴に対応するニューロン活性化の線形結合である。本稿では,文分類におけるSAEに基づく説明可能性手法の有効性について検討する。本稿では,テキスト分類に適した新しいSAEベースのアーキテクチャを提案する。我々はこのアーキテクチャを、ConceptShap、Independent Component Analysis、その他のSAEベースの概念抽出技術といった確立した手法と比較した。評価では,Pythiaファミリーの2つの分類ベンチマークと4つの微調整LDMについて検討した。さらに、外部文エンコーダを用いて、概念に基づく説明の精度を測定するための2つの新しい指標を用いて、分析をさらに強化する。私たちのアーキテクチャは,抽出した特徴の因果性と解釈性の両方を改善している。

関連論文リスト

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
既存のSAEトレーニングアルゴリズムは厳密な数学的保証を欠いていることが多く、実用的な制限に悩まされている。まず,特徴の特定可能性という新たな概念を含む特徴回復問題の統計的枠組みを提案する。本稿では、ニューラルネットワークのバイアスパラメータを適応的に調整し、適切なアクティベーション間隔を確保する手法である「バイアス適応」に基づく新たなSAEトレーニングアルゴリズムを提案する。
論文参考訳（メタデータ） (2025-06-16T20:58:05Z)
Decoding Dense Embeddings: Sparse Autoencoders for Interpreting and Discretizing Dense Retrieval [13.31210969917096]
本稿では,Dense Passage Retrieval(DPR)モデルに対する新しい解釈可能性フレームワークを提案する。我々は,各潜伏概念の自然言語記述を生成し,DPRモデルの密埋め込みと問合せ文書類似度スコアの両方の人間の解釈を可能にする。概念レベルスパース検索(CL-SR)は,語彙や意味的ミスマッチ間の堅牢な性能を維持しつつ,高いインデックス空間と計算効率を実現する。
論文参考訳（メタデータ） (2025-05-28T02:50:17Z)
Unveiling Instruction-Specific Neurons & Experts: An Analytical Framework for LLM's Instruction-Following Capabilities [12.065600268467556]
LLM(Large Language Models)の微調整は命令追従能力を大幅に向上させた。本研究では,命令固有スパース成分の分離と解析により,LLM計算の微調整について検討する。
論文参考訳（メタデータ） (2025-05-27T13:40:28Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
本研究では,大規模言語モデル(LLM)に基づくマルチターン対話環境におけるエージェントの評価手法について検討する。我々は250近い学術資料を体系的にレビューし、様々な出版場所から芸術の状態を捉えた。
論文参考訳（メタデータ） (2025-03-28T14:08:40Z)
Integration of Explainable AI Techniques with Large Language Models for Enhanced Interpretability for Sentiment Analysis [0.5120567378386615]
大規模言語モデル(LLM)による感情分析における解釈可能性の重要性本研究では,LLMを埋め込み層,エンコーダ,デコーダ,アテンション層などのコンポーネントに分解することでSHAP(Shapley Additive Explanations)を適用する手法を提案する。この方法はStanford Sentiment Treebank (SST-2)データセットを用いて評価され、異なる文が異なる層にどのように影響するかを示す。
論文参考訳（メタデータ） (2025-03-15T01:37:54Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
スパースオートエンコーダ(SAE)は、複雑なニューラルネットワークから解釈可能な特徴を抽出する可能性を示している。大規模言語モデルからの高密度テキスト埋め込みに対するSAEの最初の応用の1つを提示する。その結果,解釈可能性を提供しながら意味的忠実さを保っていることが明らかとなった。
論文参考訳（メタデータ） (2024-08-01T15:46:22Z)
Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition [51.66383337087724]
Handwriting Mathematical Expression Recognition (HMER) タスクは、OCRの分野における重要な分岐である。近年の研究では、双方向コンテキスト情報の導入により、HMERモデルの性能が大幅に向上することが示されている。本稿では,MF-SLT と双方向非同期トレーニング (BAT) 構造を提案する。
論文参考訳（メタデータ） (2023-12-31T09:24:21Z)
A Recursive Bateson-Inspired Model for the Generation of Semantic Formal Concepts from Spatial Sensory Data [77.34726150561087]
本稿では,複雑な感覚データから階層構造を生成するための記号のみの手法を提案する。このアプローチは、概念や概念の創始の鍵としてのバテソンの差異の概念に基づいている。このモデルは、トレーニングなしでかなりリッチだが人間に読まれる概念表現を生成することができる。
論文参考訳（メタデータ） (2023-07-16T15:59:13Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
アスペクトベース感情分析のための弱教師付きアプローチを提案する。 We learn sentiment, aspects> joint topic embeddeds in the word embedding space。次に、ニューラルネットワークを用いて単語レベルの識別情報を一般化する。
論文参考訳（メタデータ） (2020-10-13T21:33:24Z)
A Diagnostic Study of Explainability Techniques for Text Classification [52.879658637466605]
既存の説明可能性技術を評価するための診断特性のリストを作成する。そこで本研究では, モデルの性能と有理性との整合性の関係を明らかにするために, 説明可能性手法によって割り当てられた有理性スコアと有理性入力領域の人間のアノテーションを比較した。
論文参考訳（メタデータ） (2020-09-25T12:01:53Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。