Fugu-MT 論文翻訳(概要): Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

論文の概要: Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

arxiv url: http://arxiv.org/abs/2512.08892v1
Date: Tue, 09 Dec 2025 18:33:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:08.091449
Title: Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
Title（参考訳）: スパースオートエンコーダを用いた信頼度向上に向けて
Authors: Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang,
Abstract要約: Retrieval-Augmented Generation (RAG) は、大規模言語モデル(LLM)の事実性を改善する。既存のRAGの幻覚検出法は、しばしば大規模な検出器の訓練に頼っている。 RAGLensは、RAG出力を正確にフラグする軽量幻覚検出器である。
参考スコア（独自算出の注目度）: 39.5490415037017
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG) は、検索された証拠にアウトプットを基礎にすることで、大きな言語モデル(LLM)の事実性を向上するが、世代が提供された情報源を超えて矛盾または拡張する忠実さの失敗は、依然として重要な課題である。既存のRAGの幻覚検出法は、大量のアノテートデータを必要とする大規模な検出器の訓練や、外部のLCM審査員の問い合わせに頼っていることが多く、推論コストが高い。 LLMの内部表現を幻覚検出に活用しようとするアプローチもあるが、精度は限られている。近年の機械的解釈可能性の向上により, 内部の活性化を阻害するスパースオートエンコーダ (SAE) が採用され, RAG幻覚時に特異的に誘発される特徴の同定に成功した。情報に基づく特徴選択と付加的特徴モデリングの体系的なパイプラインを構築し,LLM内部表現を用いて不信なRAG出力を正確にフラグする軽量幻覚検出器RAGLensを導入する。 RAGLensは既存の手法に比べて優れた検出性能を達成できるだけでなく、その決定に対する解釈可能な合理的性も提供し、不誠実なRAGの効果的なポストホック緩和を可能にする。最後に, 設計選択を正当化し, LLM内の幻覚関連信号の分布に関する新たな知見を明らかにする。コードはhttps://github.com/Teddy-XiongGZ/RAGLensで公開されている。

論文の概要: Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

関連論文リスト