Fugu-MT 論文翻訳(概要): Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

論文の概要: Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

arxiv url: http://arxiv.org/abs/2505.17120v1
Date: Wed, 21 May 2025 16:35:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-26 18:08:33.58373
Title: Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Title（参考訳）: 自己解釈性: LLMは意思決定を駆動する複雑な内部プロセスを記述することができ、トレーニングによって改善される
Authors: Dillon Plunkett, Adam Morris, Keerthi Reddy, Jorge Morales,
Abstract要約: 現代における大規模言語モデル(LLM)は,その内部プロセスの正確かつ定量的な記述を提供することができることを示す。我々は GPT-4o と GPT-4o-mini を微調整し、様々な複雑な文脈で意思決定を行った。これらのLCMは、より正確に意思決定を説明するために微調整可能であることを実証する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We have only limited understanding of how and why large language models (LLMs) respond in the ways that they do. Their neural networks have proven challenging to interpret, and we are only beginning to tease out the function of individual neurons and circuits within them. However, another path to understanding these systems is to investigate and develop their capacity to introspect and explain their own functioning. Here, we show that i) contemporary LLMs are capable of providing accurate, quantitative descriptions of their own internal processes during certain kinds of decision-making, ii) that it is possible to improve these capabilities through training, and iii) that this training generalizes to at least some degree. To do so, we fine-tuned GPT-4o and GPT-4o-mini to make decisions in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to weigh different attributes during decision-making (e.g., the relative importance of natural light versus quiet surroundings for condos). We demonstrate that the LLMs can accurately report these preferences (i.e., the weights that they learned to give to different attributes during decision-making). Next, we demonstrate that these LLMs can be fine-tuned to explain their decision-making even more accurately. Finally, we demonstrate that this training generalizes: It improves the ability of the models to accurately explain what they are doing as they make other complex decisions, not just decisions they have learned to make via fine-tuning. This work is a step towards training LLMs to accurately and broadly report on their own internal processes -- a possibility that would yield substantial benefits for interpretability, control, and safety.
Abstract（参考訳）: 大規模な言語モデル(LLM)が、どのように、なぜ、どのように反応するかについてしか理解していません。彼らのニューラルネットワークは解釈が難しいことが証明されており、我々はその内部の個々のニューロンや回路の機能の解明を始めたばかりです。しかし、これらのシステムを理解するもう1つの道は、それらの機能を内観し、説明する能力について調査し、発展させることである。以下に示すのは一現代LPMは、特定意思決定の過程で、その内部過程の正確かつ定量的な説明をすることができる。二訓練によりこれらの能力を向上させることができること、及び三この訓練が少なくともある程度一般化すること。そこで我々はGPT-4oとGPT-4o-miniを微調整し、意思決定中に異なる属性(例えば、自然光の相対的重要性と、コンドームの静かな環境)を無作為に生成した定量的な選好に基づいて、さまざまな複雑な状況(例えば、コンド、ローン、休暇の選択など)で意思決定を行う。 LLMはこれらの選好を正確に報告できることを実証する(すなわち、意思決定中に異なる属性に与えた重み)。次に、これらのLCMを微調整して、意思決定をより正確に説明できることを実証する。最後に、このトレーニングが一般化していることを実証する。微調整によって学んだ決定だけでなく、他の複雑な決定を行うときに、モデルが何をしているかを正確に説明する能力を改善します。この作業は、LSMをトレーニングして、自身の内部プロセスについて正確かつ広範囲に報告する、というステップです。

論文の概要: Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training

関連論文リスト