Fugu-MT 論文翻訳(概要): DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

論文の概要: DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

arxiv url: http://arxiv.org/abs/2202.04053v3
Date: Wed, 30 Aug 2023 18:41:01 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-01 21:27:43.238464
Title: DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
Title（参考訳）: DALL-Eval:テキスト・画像生成モデルの推論スキルと社会的バイアスの探索
Authors: Jaemin Cho, Abhay Zala, Mohit Bansal
Abstract要約: テキスト・ツー・イメージ・モデルの視覚的推論能力と社会的バイアスについて検討する。まず,物体認識,物体カウント,空間的関係理解という3つの視覚的推論スキルを計測する。第2に、生成した画像の性別/肌の色調分布を測定することにより、性別と肌のトーンバイアスを評価する。
参考スコア（独自算出の注目度）: 73.12069620086311
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, DALL-E, a multimodal transformer language model, and its variants, including diffusion models, have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images across various professions and attributes. We demonstrate that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs. We hope our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval
Abstract（参考訳）: 近年、マルチモーダルトランスフォーマー言語モデルであるdall-eとその拡散モデルを含む変種は高品質なテキスト対画像生成能力を示している。しかし、現実的な画像生成結果にもかかわらず、そのようなモデルの評価方法に関する詳細な分析は行われていない。本研究では,様々なテキスト対画像モデルの視覚的推論能力と社会的バイアスを調査し,マルチモーダルトランスフォーマー言語モデルと拡散モデルの両方をカバーする。まず,物体認識,物体カウント,空間的関係理解の3つの視覚的推論スキルを測定する。そこで本研究では,これらのスキルを測定する構成診断評価データセットであるpaintskillsを提案する。忠実度の高い画像生成能力にもかかわらず、最近のモデルの性能とオブジェクトカウントと空間関係理解スキルの上限精度の間には大きなギャップが存在する。次に,様々な職業や属性における生成画像の性別・肌色分布を計測し,性別・肌色バイアスを評価する。近年のテキスト対画像生成モデルは、web画像とテキストのペアから性別や肌のトーンに関する特定のバイアスを学習できることを実証する。われわれの研究は、視覚的推論スキルのテキスト・ツー・イメージ生成モデルの改善と、社会的に偏見のない表現の学習の今後の進歩を導いてくれることを期待している。コードとデータ:https://github.com/j-min/DallEval

論文の概要: DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

関連論文リスト