Fugu-MT 論文翻訳(概要): BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

論文の概要: BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

arxiv url: http://arxiv.org/abs/2508.08855v2
Date: Thu, 14 Aug 2025 17:57:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 13:42:23.64451
Title: BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them
Title（参考訳）: BiasGym:幻想的なLSMビアーズと、そのテーマを見つける(そして取り除く)方法
Authors: Sekh Mainul Islam, Nadav Borenstein, Siddhesh Milind Pawar, Haeun Yu, Arnav Arora, Isabelle Augenstein,
Abstract要約: BiasGymは、大規模言語モデル(LLM)内の概念的関連性を確実に注入し、分析し、緩和するためのフレームワークである。 BiasGymは2つのコンポーネントで構成されている。BiasInjectはトークンベースの微調整を通じてモデルに特定のバイアスを注入する。提案手法は,機械的解析のための一貫したバイアス抽出を可能にし,下流タスクの性能低下を伴わずに目標デバイアス化をサポートし,トークンベースの微調整時に見つからないバイアスを一般化する。
参考スコア（独自算出の注目度）: 38.80876158025777
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during token-based fine-tuning. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from Italy being `reckless drivers') and in probing fictional associations (e.g., people from a fictional country having `blue skin'), showing its utility for both safety interventions and interpretability research.
Abstract（参考訳）: 大規模言語モデル(LLM)の重みに符号化されたバイアスやステレオタイプを理解することは、効果的な緩和戦略の開発に不可欠である。バイアスされた行動は、しばしば微妙で、故意に誘発されたとしても孤立し、体系的な分析や偏見を特に困難にしている。そこで我々は,LLM内の概念的関連性を確実に注入し,分析し,緩和する,シンプルで費用効率の良い,一般化可能なフレームワークであるBiasGymを紹介した。 BiasGymは2つのコンポーネントで構成されている。BiasInjectはトークンベースの微調整を通じてモデルに特定のバイアスを注入する。提案手法は,機械的解析のための一貫したバイアス抽出を可能にし,下流タスクの性能低下を伴わずに目標デバイアス化をサポートし,トークンベースの微調整時に見つからないバイアスを一般化する。本研究では,BiasGymが現実世界のステレオタイプ(例えば,イタリア出身者が「無謀運転者」)を減らし,フィクション協会(例えば,「青肌」を持つ架空の国出身者)の探究に有効であることを実証し,安全介入と解釈可能性研究の両方に有用であることを示す。

論文の概要: BiasGym: Fantastic LLM Biases and How to Find (and Remove) Them

関連論文リスト