Fugu-MT 論文翻訳(概要): Vision-Language Models display a strong gender bias

論文の概要: Vision-Language Models display a strong gender bias

arxiv url: http://arxiv.org/abs/2508.11262v1
Date: Fri, 15 Aug 2025 06:57:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.772726
Title: Vision-Language Models display a strong gender bias
Title（参考訳）: Vision-Language Modelsはジェンダーバイアスが強い
Authors: Aiswarya Konavoor, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat,
Abstract要約: 職業や活動を記述した短いフレーズの埋め込みの近くに顔画像の埋め込みを配置する際、対照的な視覚言語エンコーダが性関係を示すかどうかを検証する。感情労働、認知労働、国内労働、技術労働、専門職、身体労働を含む6つのカテゴリにまたがる150のユニークな言明と220枚の顔写真からなるデータセットを収集した。この結果は、不確実性、単純な正当性チェック、頑健な性別バイアス評価フレームワークを伴い、対照的な視覚言語空間における性関係の声明的およびカテゴリー的マップである。
参考スコア（独自算出の注目度）: 1.4633779950109127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLM) align images and text in a shared representation space that is useful for retrieval and zero-shot transfer. Yet, this alignment can encode and amplify social stereotypes in subtle ways that are not obvious from standard accuracy metrics. In this study, we test whether the contrastive vision-language encoder exhibits gender-linked associations when it places embeddings of face images near embeddings of short phrases that describe occupations and activities. We assemble a dataset of 220 face photographs split by perceived binary gender and a set of 150 unique statements distributed across six categories covering emotional labor, cognitive labor, domestic labor, technical labor, professional roles, and physical labor. We compute unit-norm image embeddings for every face and unit-norm text embeddings for every statement, then define a statement-level association score as the difference between the mean cosine similarity to the male set and the mean cosine similarity to the female set, where positive values indicate stronger association with the male set and negative values indicate stronger association with the female set. We attach bootstrap confidence intervals by resampling images within each gender group, aggregate by category with a separate bootstrap over statements, and run a label-swap null model that estimates the level of mean absolute association we would expect if no gender structure were present. The outcome is a statement-wise and category-wise map of gender associations in a contrastive vision-language space, accompanied by uncertainty, simple sanity checks, and a robust gender bias evaluation framework.
Abstract（参考訳）: 視覚言語モデル(VLM)は、検索やゼロショット転送に有用な共有表現空間において、画像とテキストをアライメントする。しかし、このアライメントは、標準的な精度の指標から明らかでない微妙な方法で、社会的ステレオタイプをエンコードし、増幅することができる。本研究では,コントラスト型視覚言語エンコーダが,職業や活動を記述した短いフレーズの埋め込み付近に顔画像の埋め込みを配置する際,性関係を示すか否かを検証した。感情労働、認知労働、国内労働、技術労働、専門職、身体労働を含む6つのカテゴリにまたがる150のユニークな言明と220枚の顔写真からなるデータセットを収集した。本研究は,各面の単位ノルム画像埋め込みと各文の単位ノルムテキスト埋め込みを計算し,男性集合と平均コサイン類似度と女性集合との平均コサイン類似度との差として文レベルの関連スコアを定義し,正の値が男性集合とより強い関連性を示すとともに,陰の値が女性集合とより強い関連性を示す。我々は,各性別グループ内の画像を再サンプリングし,個別のブートストラップで分類し,性別構造が存在しない場合に期待できる平均絶対関連度を推定するラベルスワップ・ヌルモデルを実行することで,ブートストラップの信頼区間を付与する。この結果は、不確実性、単純な正当性チェック、頑健な性別バイアス評価フレームワークを伴い、対照的な視覚言語空間における性関係の声明的およびカテゴリー的マップである。

論文の概要: Vision-Language Models display a strong gender bias

関連論文リスト