Large Multimodal Models as General In-Context Classifiers
- URL: http://arxiv.org/abs/2602.23229v1
- Date: Thu, 26 Feb 2026 17:08:18 GMT
- Title: Large Multimodal Models as General In-Context Classifiers
- Authors: Marco Garosi, Matteo Farina, Alessandro Conti, Massimiliano Mancini, Elisa Ricci,
- Abstract summary: In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning.<n>We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters.<n>We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task.
- Score: 73.11242790834383
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
Related papers
- FewMMBench: A Benchmark for Multimodal Few-Shot Learning [17.747746608503114]
FewMMBench is a comprehensive benchmark designed to evaluate multimodal large language models (MLLMs) under few-shot conditions.<n>We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings.<n>Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning.
arXiv Detail & Related papers (2026-02-25T12:30:18Z) - Group Fairness Meets the Black Box: Enabling Fair Algorithms on Closed LLMs via Post-Processing [14.622788745587815]
We propose a framework for deriving fair classifiers from closed-weight LLMs via prompting.<n>Our framework is data-efficient and outperforms fair classifiers trained on LLM embeddings.
arXiv Detail & Related papers (2025-08-15T06:50:29Z) - On Large Multimodal Models as Open-World Image Classifiers [77.51330631977955]
Large Multimodal Models (LMMs) can classifying images directly using natural language.<n>We evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes.
arXiv Detail & Related papers (2025-03-27T17:03:18Z) - Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model [75.750699619993]
We propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation.<n>Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously.
arXiv Detail & Related papers (2024-11-29T07:00:18Z) - Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features [79.45405711339322]
Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks.<n>Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks.<n>We propose an approach that leverages multimodal feature extraction from the LMM's latent space.
arXiv Detail & Related papers (2024-11-28T18:55:41Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.