Fugu-MT 論文翻訳(概要): Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

論文の概要: Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

arxiv url: http://arxiv.org/abs/2509.07050v1
Date: Mon, 08 Sep 2025 15:54:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 14:38:27.05693
Title: Automated Evaluation of Gender Bias Across 13 Large Multimodal Models
Title（参考訳）: 13大マルチモーダルモデルにおけるジェンダーバイアスの自動評価
Authors: Juan Manuel Contreras,
Abstract要約: 大規模マルチモーダルモデル(LMM)はテキスト・ツー・イメージ生成に革命をもたらしたが、トレーニングデータに有害な社会的バイアスが持続するリスクがある。 AI生成画像における社会的バイアスを評価するベンチマークである,Aymara Image Fairness Evaluationを導入する。我々は,75のプロシージャ生成性中立性プロンプトを用いて13の市販LMMを試験し,ステレオタイプ,ステレオタイプ,非ステレオタイプの職業の人を生成する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)はテキスト・ツー・イメージ生成に革命をもたらしたが、トレーニングデータに有害な社会的バイアスが持続するリスクがある。これまでの研究では、これらのモデルにおける性別バイアスが特定されていたが、方法論上の制限により、大規模で同等のクロスモデル分析が妨げられた。このギャップに対処するために、AI生成画像の社会的バイアスを評価するベンチマークであるAymara Image Fairness Evaluationを導入する。我々は,75のプロシージャ生成性中立性プロンプトを用いて13の市販LMMを試験し,ステレオタイプ,ステレオタイプ,非ステレオタイプの職業の人を生成する。次に、検証されたLCM-as-a-judgeシステムを用いて、性別表現のための965の画像をスコアリングする。結果が明らかになる(p < .001)。 1) LMMは、組織的に、実世界の労働データに対する職業性ステレオタイプを増幅し、男性ステレオタイプ専門職では93.0%、女性ステレオタイプ専門職では22.5%である。 2)モデルでは,非ステレオタイプ専門職の68.3%に男性を産み出す,強い既定偏見を示す。 3) 偏見の程度はモデルによって大きく異なり、男性全体の表現は46.7%から73.3%まで様々である。特に、トップパフォーマンスモデルでは、性別ステレオタイプを非増幅し、ジェンダーパリティに近づき、最も公正なスコアを得た。この変化は、高いバイアスは避けられない結果ではなく、設計選択の結果であることを示している。私たちの研究は、これまでで最も包括的なジェンダーバイアスのクロスモデルベンチマークを提供し、AI開発における説明責任と公正性を促進するための標準化された自動評価ツールの必要性を強調しています。

論文の概要: Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

関連論文リスト