Related papers: Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles

URL: http://arxiv.org/abs/2509.08777v1
Date: Wed, 10 Sep 2025 17:06:47 GMT
Title: Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
Authors: Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee,
Abstract summary: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems.<n>These "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains.<n>We propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB)
Score: 20.7718577645105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) are increasingly used to evaluate text-to-image (TTI) generation systems, providing automated judgments based on visual and textual context. However, these "judge" models often suffer from biases, overconfidence, and inconsistent performance across diverse image domains. While prompt ensembling has shown promise for mitigating these issues in unimodal, text-only settings, our experiments reveal that standard ensembling methods fail to generalize effectively for TTI tasks. To address these limitations, we propose a new multimodal-aware method called Multimodal Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt ensemble approach augmented by image clustering, allowing the judge to dynamically assign prompt weights based on the visual characteristics of each sample. We show that MMB improves accuracy in pairwise preference judgments and greatly enhances calibration, making it easier to gauge the judge's true uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB outperforms existing baselines in alignment with human annotations and calibration across varied image content. Our findings highlight the importance of multimodal-specific strategies for judge calibration and suggest a promising path forward for reliable large-scale TTI evaluation.

Related papers

Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis [95.89328387635176]
We introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing.<n>We present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics.
arXiv Detail & Related papers (2026-02-13T15:34:32Z)
Bi-Level Prompt Optimization for Multimodal LLM-as-a-Judge [21.61898421774144]
Large language models (LLMs) have become widely adopted as automated judges for evaluating AI-generated content.<n>Despite their success, aligning LLM-based evaluations with human judgments remains challenging.<n>We propose BLPO, a bi-level prompt optimization framework that converts images into textual representations while preserving evaluation-relevant visual cues.
arXiv Detail & Related papers (2026-02-11T20:22:13Z)
More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
Importance Sampling for Multi-Negative Multimodal Direct Preference Optimization [68.64764778089229]
We propose MISP-DPO, the first framework to incorporate multiple, semantically diverse negative images in multimodal DPO.<n>Our method embeds prompts and candidate images in CLIP space and applies a sparse autoencoder to uncover semantic deviations into interpretable factors.<n>Experiments across five benchmarks demonstrate that MISP-DPO consistently improves multimodal alignment over prior methods.
arXiv Detail & Related papers (2025-09-30T03:24:09Z)
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates [37.65554922794508]
We introduce Multimodal Adversarial Compositionality (MAC) to generate deceptive text samples.<n>We evaluate them through both sample-wise attack success rate and group-wise entropy-based diversity.<n>Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities.
arXiv Detail & Related papers (2025-05-28T23:45:55Z)
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding [48.92310906093414]
We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs)<n>We leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models.<n>We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA.
arXiv Detail & Related papers (2025-04-30T19:19:21Z)
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment [65.3860007085689]
M3-AGIQA is a comprehensive framework that enables more human-aligned, holistic evaluation of AI-generated images.<n>By aligning model outputs more closely with human judgment, M3-AGIQA delivers robust and interpretable quality scores.
arXiv Detail & Related papers (2025-02-21T03:05:45Z)
Modality-Fair Preference Optimization for Trustworthy MLLM Alignment [22.093944381988496]
Multimodal large language models (MLLMs) have achieved remarkable success across various tasks.<n>However, separate training of visual and textual encoders often results in a misalignment of the modality.<n>These inaccuracies severely undermine the trustworthiness of MLLMs in real-world applications.
arXiv Detail & Related papers (2024-10-20T08:56:52Z)
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? [59.7772329962047]
We introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges. Specifically, we evaluate a large variety of multimodal judges including smaller-sized CLIP-based scoring models, open-source VLMs, and close-source VLMs. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in average.
arXiv Detail & Related papers (2024-07-05T20:03:16Z)
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models [16.18275805302776]
We propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. EvalAlign aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
arXiv Detail & Related papers (2024-06-24T11:56:15Z)
Debiasing Multimodal Large Language Models via Penalization of Language Priors [38.97645845493758]
Multimodal Large Language Models (MLLMs) have become indispensable tools in computer vision and natural language processing.<n>Despite their advancements, our investigation reveals a noteworthy bias: the generated content is often driven more by the inherent priors of the underlying Large Language Models (LLMs) than by the input image.<n>We propose two simple, training-free strategies to rectify these biases and redirect the model's focus toward visual information.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Learning from Multi-Perception Features for Real-Word Image Super-resolution [87.71135803794519]
We propose a novel SR method called MPF-Net that leverages multiple perceptual features of input images. Our method incorporates a Multi-Perception Feature Extraction (MPFE) module to extract diverse perceptual information. We also introduce a contrastive regularization term (CR) that improves the model's learning capability.
arXiv Detail & Related papers (2023-05-26T07:35:49Z)
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism. We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations. Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.