Fugu-MT 論文翻訳(概要): CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

論文の概要: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

arxiv url: http://arxiv.org/abs/2510.09871v1
Date: Fri, 10 Oct 2025 21:09:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.658446
Title: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
Title（参考訳）: CoBia: 構築された会話は、それ以外はLLMのソシエタルビアーゼをトリガーできる
Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner,
Abstract要約: CoBiaは、大規模言語モデルが規範的または倫理的行動から逸脱する条件の範囲を洗練できる軽量な敵攻撃スイートである。 CoBiaは、モデルが社会的グループに関する偏見のある主張を発話する、構築された会話を生成する。次に,モデルが生成したバイアスクレームから回復可能かどうかを評価し,バイアス付きフォローアップ質問を拒否する。
参考スコア（独自算出の注目度）: 10.340166874690578
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at https://github.com/nafisenik/CoBia.
Abstract（参考訳）: 強化された安全ガードレールを含むモデル構築の改善により、大型言語モデル(LLM)は標準安全チェックをパスしやすくなった。しかし、LLMは会話中に人種差別的視点を表現するなどの有害な行動を明らかにする。これを体系的に分析するために,LLMが会話における規範的行動や倫理的行動から外れる条件の範囲を洗練できる,軽量な敵攻撃群であるCoBiaを紹介した。 CoBiaは、モデルが社会的グループに関する偏見のある主張を発話する、構築された会話を生成する。次に,モデルが生成したバイアスクレームから回復可能かどうかを評価し,バイアス付きフォローアップ質問を拒否する。我々は、性別、人種、宗教、国籍、性的指向など、個人の安全と公正な待遇に関連する6つのカテゴリに関連するアウトプットについて、11のオープンソースとプロプライエタリなLCMを評価した。評価は, LLMの信頼性とアライメントを網羅するため, 確立したLLMに基づくバイアス指標に基づいて, 人的判断との比較を行った。その結果、意図的に会話を構築すれば、バイアスの増幅が確実に明らかになり、LLMは対話中にバイアス付きフォローアップ質問を拒否することができないことが示唆された。ストレステストのこの形態は、相互作用を通して表面化できる深く埋め込まれたバイアスを強調します。コードとアーティファクトはhttps://github.com/nafisenik/CoBia.comで入手できる。

論文の概要: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

関連論文リスト