Fugu-MT 論文翻訳(概要): Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

論文の概要: Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

arxiv url: http://arxiv.org/abs/2211.12112v1
Date: Tue, 22 Nov 2022 09:27:53 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-23 16:34:46.675979
Title: Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
Title（参考訳）: マルチタスクベンチマークによるテキスト対画像モデルの人間評価
Authors: Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin, Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer, Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori
Abstract要約: テキスト・ツー・イメージ・モデルを評価するための新しいマルチタスク・ベンチマークを提供する。我々は、最も一般的なオープンソース(安定拡散)と商用(DALL-E2)モデルを比較した。 20人のコンピュータサイエンスの大学院生が、2つのモデルを3つのタスクで評価し、それぞれ10のプロンプトで3つの難易度で評価した。
参考スコア（独自算出の注目度）: 80.79082788458602
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.
Abstract（参考訳）: テキストから画像へのモデルを評価するための新しいマルチタスクベンチマークを提供する。我々は,最も一般的なオープンソース(安定拡散)と商用(DALL-E2)モデルの比較を行う。 20人のコンピュータサイエンスの大学院生が3つのタスクで、それぞれ10のプロンプトで2つのモデルを評価し、3600のレーティングを提供した。テキストから画像への生成は、多くの最近のモデルが様々なプロンプトで現実的な高解像度画像を作成する能力を示している点まで急速に進歩している。しかし、現在のテキストから画像への方法や視覚言語理解に関するより広範な研究は、複数の属性と関係を持つ多くのオブジェクトを含む複雑なテキストプロンプトに未だに苦労している。テキストプロンプトのさまざまな機能を扱うモデルの能力をキャプチャする複数のアプリケーションに対して,32タスクのスイートを含む,新たなtext-to-imageベンチマークを導入する。例えば、モデルに同じオブジェクトのさまざまな数を生成するように要求したり、異なる属性を持つ複数のオブジェクトでテキストプロンプトを計測したりすることで、オブジェクトと属性を正しく一致させることができる。提案するマルチタスク・ベンチマークは,複数のプロンプトに対してテキスト・ツー・イメージの結果を主観的に評価するのではなく,3つの難易度(易易度,中度,硬度)の課題タスクと,生成された画像に対する人間の評価からなる。

論文の概要: Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

関連論文リスト