Related papers: Visual Enumeration Remains Challenging for Multimodal Generative AI

Visual Enumeration Remains Challenging for Multimodal Generative AI

URL: http://arxiv.org/abs/2402.03328v3
Date: Mon, 28 Jul 2025 14:18:37 GMT
Title: Visual Enumeration Remains Challenging for Multimodal Generative AI
Authors: Alberto Testolin, Kuinan Hou, Marco Zorzi,
Abstract summary: It has been observed that even state-of-the-art AI systems have very limited enumeration skills.<n>We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) AI systems.<n>Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items.
Score: 0.08192907805418582
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many animal species can approximately judge the number of objects in a visual scene at a single glance, and humans can further determine the exact cardinality of a set by deploying systematic counting procedures. In contrast, it has been observed that even state-of-the-art AI systems have very limited enumeration skills. In this work, we propose two benchmark tasks inspired by cognitive science that allow to precisely evaluate the visual enumeration capabilities of multimodal foundation models, thereby providing an objective measure of their number sense and counting level. We consider popular visual question answering models (BLIP, LLaVA and ViLT) as well as advanced image-to-text (Gemini, GPT and Qwen) and text-to-image (DALL-E, FLUX and Stable Diffusion) AI systems. Our analyses show that even the most advanced models cannot reliably name the number of objects in simple visual stimuli or generate images containing a target number of items, as indexed by their low accuracy in both types of tasks. Especially for numbers outside the subitizing range, their responses are often far from the target numerosity, and, in stark contrast with human behavior, in many cases the distribution of errors depends on the object category. We also observe some striking mistakes with small numbers. Our findings demonstrate that developing an intuitive visual understanding of number remains challenging for AI models and that merely increasing model size might not be a viable strategy to promote the emergence of systematic counting skills. We release the full code of our benchmark to facilitate the evaluation of enumeration skills in future AI systems.

Related papers

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Visual Graph Arena: Evaluating Visual Conceptualization of Vision and Multimodal Large Language Models [51.900488744931785]
We introduce the Visual Graph Arena (VGA) to evaluate and improve AI systems' capacity for visual abstraction.<n>Humans achieve near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks.<n>By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models.
arXiv Detail & Related papers (2025-06-06T17:06:25Z)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z)
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z)
Mitigating Low-Level Visual Hallucinations Requires Self-Awareness: Database, Model and Training Strategy [53.07517728420411]
We introduce the first instruction database specifically focused on hallucinations in low-level vision tasks.<n>We propose the Self-Awareness Failure Elimination (SAFEQA) model to improve the perception and comprehension abilities of the model in low-level vision tasks.<n>We conduct comprehensive experiments on low-level vision tasks, with the results demonstrating that our proposed method significantly enhances self-awareness of the model in these tasks and reduces hallucinations.
arXiv Detail & Related papers (2025-03-26T16:05:01Z)
Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts.<n>We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z)
LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models [5.892066196730199]
Large vision-language models (LVLMs) are known to struggle with counting tasks.<n>We propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects.<n>We demonstrate the effectiveness of this approach across various datasets and benchmarks.
arXiv Detail & Related papers (2024-12-01T05:50:22Z)
Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts? [62.984473889987605]
We present a zero-shot framework for fine-grained visual concept learning by leveraging large language model and Visual Question Answering (VQA) system. We pose these questions along with the query image to a VQA system and aggregate the answers to determine the presence or absence of an object in the test images. Our experiments demonstrate comparable performance with existing zero-shot visual classification methods and few-shot concept learning approaches.
arXiv Detail & Related papers (2024-10-17T15:16:10Z)
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks [25.959032350818795]
We present HumanEval-V, a benchmark of human-annotated coding tasks.<n>Each task features carefully crafted diagrams paired with function signatures and test cases.<n>We find that even top-performing models achieve modest success rates.
arXiv Detail & Related papers (2024-10-16T09:04:57Z)
Estimating the distribution of numerosity and non-numerical visual magnitudes in natural scenes using computer vision [0.08192907805418582]
We show that in natural visual scenes the frequency of appearance of different numerosities follows a power law distribution. We show that the correlational structure for numerosity and continuous magnitudes is stable across datasets and scene types.
arXiv Detail & Related papers (2024-09-17T09:49:29Z)
A Sanity Check for AI-generated Image Detection [49.08585395873425]
We propose AIDE (AI-generated Image DEtector with Hybrid Features) to detect AI-generated images.<n>AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods.
arXiv Detail & Related papers (2024-06-27T17:59:49Z)
Evaluating Numerical Reasoning in Text-to-Image Models [16.034479049513582]
We evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty. We show that even the most advanced models have only rudimentary numerical skills. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
arXiv Detail & Related papers (2024-06-20T22:56:31Z)
Quantity Matters: Towards Assessing and Mitigating Number Hallucination in Large Vision-Language Models [57.42800112251644]
We focus on a specific type of hallucination-number hallucination, referring to models incorrectly identifying the number of certain objects in pictures. We devise a training approach aimed at improving consistency to reduce number hallucinations, which leads to an 8% enhancement in performance over direct finetuning methods.
arXiv Detail & Related papers (2024-03-03T02:31:11Z)
WinoViz: Probing Visual Properties of Objects Under Different States [39.92628807477848]
We present a text-only evaluation dataset consisting of 1,380 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states. Our task is challenging since it requires pragmatic reasoning (finding intended meanings) and visual knowledge reasoning. We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task.
arXiv Detail & Related papers (2024-02-21T07:31:47Z)
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions [138.49522643425334]
Bongard-HOI is a new visual reasoning benchmark that focuses on compositional learning of human-object interactions from natural images. It is inspired by two desirable characteristics from the classical Bongard problems (BPs): 1) few-shot concept learning, and 2) context-dependent reasoning. Bongard-HOI presents a substantial challenge to today's visual recognition models.
arXiv Detail & Related papers (2022-05-27T07:36:29Z)
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks [37.730939229638224]
We propose NumGLUE, a benchmark that evaluates the performance of AI systems on eight different tasks. We show that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models. We hope that NumGLUE will encourage systems that perform robust and general arithmetic reasoning within language.
arXiv Detail & Related papers (2022-04-12T09:36:10Z)
Object-Centric Diagnosis of Visual Reasoning [118.36750454795428]
This paper presents a systematical object-centric diagnosis of visual reasoning on grounding and robustness. We develop a diagnostic model, namely Graph Reasoning Machine. Our model replaces purely symbolic visual representation with probabilistic scene graph and then applies teacher-forcing training for the visual reasoning module.
arXiv Detail & Related papers (2020-12-21T18:59:28Z)
A Number Sense as an Emergent Property of the Manipulating Brain [16.186932790845937]
We study the mechanism through which humans acquire and develop the ability to manipulate numbers and quantities. Our model acquires the ability to estimate numerosity, i.e. the number of objects in the scene. We conclude that important aspects of a facility with numbers and quantities may be learned with supervision from a simple pre-training task.
arXiv Detail & Related papers (2020-12-08T00:37:35Z)
A robot that counts like a child: a developmental model of counting and pointing [69.26619423111092]
A novel neuro-robotics model capable of counting real items is introduced. The model allows us to investigate the interaction between embodiment and numerical cognition. The trained model is able to count a set of items and at the same time points to them.
arXiv Detail & Related papers (2020-08-05T21:06:27Z)
Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG) These visual arithmetic problems are in the form of geometric figures. We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.