Fugu-MT 論文翻訳(概要): Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

論文の概要: Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

arxiv url: http://arxiv.org/abs/2511.00177v1
Date: Fri, 31 Oct 2025 18:29:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.65327
Title: Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Title（参考訳）: SAEは医療におけるLSMの人種的偏見を明らかにし、緩和できるのか?
Authors: Hiba Ahsan, Byron C. Wallace,
Abstract要約: Sparse Autoencoders (SAEs) が、レースとスティグマティゼーションの概念の関連性を明らかにすることができるかを評価する。われわれはこの潜伏型モデルを用いて黒人患者のアウトプットを生成する。これは簡単な設定で改善するが、より現実的で複雑な臨床タスクでは成功しない。
参考スコア（独自算出の注目度）: 15.038824492025457
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in Gemma-2 models which appear to correlate with Black individuals. We find that this latent activates on reasonable input sequences (e.g., "African American") but also problematic words like "incarceration". We then show that we can use this latent to steer models to generate outputs about Black patients, and further that this can induce problematic associations in model outputs as a result. For example, activating the Black latent increases the risk assigned to the probability that a patient will become "belligerent". We evaluate the degree to which such steering via latents might be useful for mitigating bias. We find that this offers improvements in simple settings, but is less successful for more realistic and complex clinical tasks. Overall, our results suggest that: SAEs may offer a useful tool in clinical applications of LLMs to identify problematic reliance on demographics but mitigating bias via SAE steering appears to be of marginal utility for realistic tasks.
Abstract（参考訳）: LLMは医療での利用が増えている。これにより、医師は干ばつから解放され、より優れたケアを大規模に提供できるようになる。しかし、この分野でのLSMの使用はリスクをもたらし、例えば、そのようなモデルが既存のバイアスを悪化させる可能性がある。 LLMが(一時的に)患者レースに依存して予測を通知する時、私たちはどのように見つけることができるのか? 本研究では,Sparse Autoencoders (SAEs) が,レースとシグマライズの概念の間のモデルが生み出した関係を明らかにする(そして制御する)程度を評価する。我々はまず,黒色個体と相関しているように見えるGemma-2モデルにおいて,SAE潜伏剤を同定した。この潜伏剤は合理的な入力シーケンス(例えば「アフリカ系アメリカ人」)を活性化するが、同時に「投獄」のような問題のある単語も活性化する。次に、この潜伏モデルを用いて、黒人患者のアウトプットを生成することを示し、その結果、モデルアウトプットに問題のある関連を誘導できることを示す。例えば、黒の潜伏剤を活性化すると、患者が「敵」になる確率に割り当てられるリスクが増大する。我々は, 潜伏剤によるそのような操舵がバイアス軽減にどのような効果があるかを評価する。これは簡単な設定で改善するが、より現実的で複雑な臨床タスクでは成功しない。 SAEはLSMの臨床的応用に有用なツールであり、人口動態への問題的依存を識別するが、SAEステアリングによる偏見の軽減は現実的なタスクには限界があると考えられる。

論文の概要: Can SAEs reveal and mitigate racial biases of LLMs in healthcare?

関連論文リスト