Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
- URL: http://arxiv.org/abs/2211.12112v1
- Date: Tue, 22 Nov 2022 09:27:53 GMT
- Title: Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark
- Authors: Vitali Petsiuk, Alexander E. Siemenn, Saisamrit Surbehera, Zad Chin,
Keith Tyser, Gregory Hunter, Arvind Raghavan, Yann Hicke, Bryan A. Plummer,
Ori Kerret, Tonio Buonassisi, Kate Saenko, Armando Solar-Lezama, Iddo Drori
- Abstract summary: We provide a new multi-task benchmark for evaluating text-to-image models.
We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models.
Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
- Score: 80.79082788458602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide a new multi-task benchmark for evaluating text-to-image models. We
perform a human evaluation comparing the most common open-source (Stable
Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI
graduate students evaluated the two models, on three tasks, at three difficulty
levels, across ten prompts each, providing 3,600 ratings. Text-to-image
generation has seen rapid progress to the point that many recent models have
demonstrated their ability to create realistic high-resolution images for
various prompts. However, current text-to-image methods and the broader body of
research in vision-language understanding still struggle with intricate text
prompts that contain many objects with multiple attributes and relationships.
We introduce a new text-to-image benchmark that contains a suite of thirty-two
tasks over multiple applications that capture a model's ability to handle
different features of a text prompt. For example, asking a model to generate a
varying number of the same object to measure its ability to count or providing
a text prompt with several objects that each have a different attribute to
identify its ability to match objects and attributes correctly. Rather than
subjectively evaluating text-to-image results on a set of prompts, our new
multi-task benchmark consists of challenge tasks at three difficulty levels
(easy, medium, and hard) and human ratings for each generated image.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.