Related papers: Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels

Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels

URL: http://arxiv.org/abs/2412.06461v1
Date: Mon, 09 Dec 2024 13:05:43 GMT
Title: Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels
Authors: Weijie Tu, Weijian Deng, Dylan Campbell, Yu Yao, Jiyang Zheng, Tom Gedeon, Tongliang Liu,
Abstract summary: Large multimodal models (LMMs) are increasingly deployed across diverse applications.<n>Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.<n>We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
Score: 64.94853276821992
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large multimodal models (LMMs) are increasingly deployed across diverse applications, the need for adaptable, real-world model ranking has become paramount. Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics, which are resource-intensive and may lack generalizability to novel scenarios, highlighting the importance of unsupervised ranking. In this work, we explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities. We evaluate state-of-the-art LMMs (e.g., LLaVA) across visual question answering benchmarks, analyzing how uncertainty-based metrics can reflect model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust, consistent basis for ranking models across varied tasks. This finding enables the ranking of LMMs on real-world, unlabeled data for visual question answering, providing a practical approach for selecting models across diverse domains without requiring manual annotation.

Related papers

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
On Large Multimodal Models as Open-World Image Classifiers [71.78089106671581]
Large Multimodal Models (LMMs) can classifying images directly using natural language. We evaluate 13 models across 10 benchmarks, encompassing prototypical, non-prototypical, fine-grained, and very fine-grained classes.
arXiv Detail & Related papers (2025-03-27T17:03:18Z)
SCORE: Systematic COnsistency and Robustness Evaluation for Large Language Models [4.875712300661656]
We present SCORE ($mathbfS$ystematic $mathbfCO$nsistency and $mathbfR$obustness $mathbfE$valuation), a comprehensive framework for non-adversarial evaluation of Large Language Models. The SCORE framework evaluates models by repeatedly testing them on the same benchmarks in various setups to give a realistic estimate of their accuracy and consistency.
arXiv Detail & Related papers (2025-02-28T19:27:29Z)
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs [16.49586486795478]
We present PairBench, a framework that systematically evaluates large vision language models (VLMs) as customizable similarity tools. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics.
arXiv Detail & Related papers (2025-02-21T04:53:11Z)
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark [16.55516587540082]
We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval. We propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing.
arXiv Detail & Related papers (2025-02-17T22:10:47Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
Learning Rich Rankings [7.940293148084844]
We develop a contextual repeated selection (CRS) model to bring a natural multimodality and richness to the rankings space. We provide theoretical guarantees for maximum likelihood estimation under the model through structure-dependent tail risk and expected risk bounds. We also furnish the first tight bounds on the expected risk of maximum likelihood estimators for the multinomial logit (MNL) choice model and the Plackett-Luce (PL) ranking model.
arXiv Detail & Related papers (2023-12-22T21:40:57Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.<n>Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
Rethinking Uncertainly Missing and Ambiguous Visual Modality in Multi-Modal Entity Alignment [38.574204922793626]
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM. Our research indicates that, in the face of modality incompleteness, models succumb to overfitting the modality noise, and exhibit performance oscillations or declines at high rates of missing modality. We introduce UMAEA, a robust multi-modal entity alignment approach designed to tackle uncertainly missing and ambiguous visual modalities.
arXiv Detail & Related papers (2023-07-30T12:16:49Z)
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples [25.657798631897908]
Feature Likelihood Divergence provides a comprehensive trichotomic evaluation of generative models. We empirically demonstrate the ability of FLD to identify overfitting problem cases, even when previously proposed metrics fail.
arXiv Detail & Related papers (2023-02-09T04:57:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.