Evaluating Numerical Reasoning in Text-to-Image Models
- URL: http://arxiv.org/abs/2406.14774v1
- Date: Thu, 20 Jun 2024 22:56:31 GMT
- Title: Evaluating Numerical Reasoning in Text-to-Image Models
- Authors: Ivana Kajić, Olivia Wiles, Isabela Albuquerque, Matthias Bauer, Su Wang, Jordi Pont-Tuset, Aida Nematzadeh,
- Abstract summary: We evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty.
We show that even the most advanced models have only rudimentary numerical skills.
We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
- Score: 16.034479049513582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
Related papers
- Visual Enumeration is Challenging for Large-scale Generative AI [0.08192907805418582]
Humans can readily judge the number of objects in a visual scene, even without counting.
We investigate whether large-scale generative Artificial Intelligence (AI) systems have a human-like number sense.
arXiv Detail & Related papers (2024-01-09T18:18:32Z) - Exploring the Numerical Reasoning Capabilities of Language Models: A
Comprehensive Analysis on Tabular Data [10.124148115680315]
We propose a hierarchical taxonomy for numerical reasoning skills with more than ten reasoning types across four levels.
We conduct a comprehensive evaluation of state-of-the-art models to identify reasoning challenges specific to them.
Our results show that no model consistently excels across all numerical reasoning types.
arXiv Detail & Related papers (2023-11-03T20:05:30Z) - Hypernymy Understanding Evaluation of Text-to-Image Models via WordNet
Hierarchy [12.82992353036576]
We measure the capability of popular text-to-image models to understand $textithypernymy$, or the "is-a" relation between words.
We show how our metrics can provide a better understanding of the individual strengths and weaknesses of popular text-to-image models.
arXiv Detail & Related papers (2023-10-13T16:53:25Z) - Word-Level Explanations for Analyzing Bias in Text-to-Image Models [72.71184730702086]
Text-to-image (T2I) models can generate images that underrepresent minorities based on race and sex.
This paper investigates which word in the input prompt is responsible for bias in generated images.
arXiv Detail & Related papers (2023-06-03T21:39:07Z) - Teaching CLIP to Count to Ten [18.703050317383322]
We introduce a simple yet effective method to improve the quantitative understanding of large vision-language models (VLMs)
We propose a new counting-contrastive loss used to finetune a pre-trained VLM in tandem with its original objective.
To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting.
arXiv Detail & Related papers (2023-02-23T14:43:53Z) - Character-Aware Models Improve Visual Text Rendering [57.19915686282047]
Current image generation models struggle to reliably produce well-formed visual text.
Character-aware models provide large gains on a novel spelling task.
Our models set a much higher state-of-the-art on visual spelling, with 30+ point accuracy gains over competitors on rare words.
arXiv Detail & Related papers (2022-12-20T18:59:23Z) - Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark [80.79082788458602]
We provide a new multi-task benchmark for evaluating text-to-image models.
We compare the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models.
Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each.
arXiv Detail & Related papers (2022-11-22T09:27:53Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.