On Large Multimodal Models as Open-World Image Classifiers
- URL: http://arxiv.org/abs/2503.21851v2
- Date: Thu, 16 Oct 2025 10:21:51 GMT
- Title: On Large Multimodal Models as Open-World Image Classifiers
- Authors: Alessandro Conti, Massimiliano Mancini, Enrico Fini, Yiming Wang, Paolo Rota, Elisa Ricci,
- Abstract summary: Large Multimodal Models (LMMs) can classifying images directly using natural language.<n>We evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes.
- Score: 77.51330631977955
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Traditional image classification requires a predefined list of semantic categories. In contrast, Large Multimodal Models (LMMs) can sidestep this requirement by classifying images directly using natural language (e.g., answering the prompt "What is the main object in the image?"). Despite this remarkable capability, most existing studies on LMM classification performance are surprisingly limited in scope, often assuming a closed-world setting with a predefined set of categories. In this work, we address this gap by thoroughly evaluating LMM classification performance in a truly open-world setting. We first formalize the task and introduce an evaluation protocol, defining various metrics to assess the alignment between predicted and ground truth classes. We then evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes, demonstrating the challenges LMMs face in this task. Further analyses based on the proposed metrics reveal the types of errors LMMs make, highlighting challenges related to granularity and fine-grained capabilities, showing how tailored prompting and reasoning can alleviate them.
Related papers
- Large Multimodal Models as General In-Context Classifiers [73.11242790834383]
In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning.<n>We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters.<n>We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task.
arXiv Detail & Related papers (2026-02-26T17:08:18Z) - Thinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs [6.790758328248708]
FiNDR (Fine-grained Name Discovery via Reasoning) is a framework for vocabulary-free fine-grained recognition.<n>It operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time.<n>Experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8%
arXiv Detail & Related papers (2025-12-21T22:01:29Z) - The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs [14.21269233160436]
We introduce Distinctive Feature Mining (DFM), a new task that challenges models to analyze a small-to-medium collection (10-40 documents) and surface features that are rare in the global context.<n>This setting mirrors real-world scenarios such as candidate selection or product differentiation, where statistical reasoning, not retrieval, is key.<n>Using DiFBench, we perform a large-scale assessment of distinctive feature mining across ten state-of-the-art LLMs.
arXiv Detail & Related papers (2025-08-29T21:23:48Z) - Large Language Models For Text Classification: Case Study And Comprehensive Review [0.3428444467046467]
We evaluate the performance of different Large Language Models (LLMs) in comparison with state-of-the-art deep-learning and machine-learning models.
Our work reveals significant variations in model responses based on the prompting strategies.
arXiv Detail & Related papers (2025-01-14T22:02:38Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.<n>Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.<n>We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model [75.750699619993]
We propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation.<n>Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously.
arXiv Detail & Related papers (2024-11-29T07:00:18Z) - An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases [0.0]
Large language models (LLMs) can exhibit bias in a variety of ways.
We propose a decision framework that allows practitioners to determine which bias and fairness metrics to use for a specific use case.
arXiv Detail & Related papers (2024-07-15T16:04:44Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.