MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
- URL: http://arxiv.org/abs/2510.27196v1
- Date: Fri, 31 Oct 2025 05:39:03 GMT
- Title: MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models
- Authors: Zixin Chen, Hongzhan Lin, Kaixin Li, Ziyang Luo, Yayue Deng, Jing Ma,
- Abstract summary: The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to understand multimodal harmfulness.<n>Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks.<n>We propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness.
- Score: 25.461441362074257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs' detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs' understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs' abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at https://github.com/Lbotirx/MemeArena.
Related papers
- Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z) - CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance [10.843417240658992]
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs)<n>We argue that existing benchmarks for evaluating this ability have critical shortcomings.<n>We introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB)
arXiv Detail & Related papers (2025-08-22T08:17:31Z) - AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness [16.4111250168657]
multimodal Large Language Models (mLLMs) must effectively understand meme harmfulness.<n>Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets.<n>We propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness.
arXiv Detail & Related papers (2025-07-02T13:32:30Z) - HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation [39.7293877954587]
HiMATE is a Hierarchical Multi-Agent Framework for Machine Translation Evaluation.<n>We develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors.<n> Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations.
arXiv Detail & Related papers (2025-05-22T06:24:08Z) - Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation [20.639904433330162]
Generation capabilities and language coverage of multilingual large language models (mLLMs) are advancing rapidly.<n>However, evaluation practices for mLLMs are still lacking comprehensiveness, scientific rigor, and consistent adoption across research labs.<n>We draw parallels with machine translation (MT) evaluation, a field that faced similar challenges and has, over decades, developed transparent reporting standards.<n>We distill these insights into a checklist of actionable recommendations for mLLM research and development.
arXiv Detail & Related papers (2025-04-16T07:38:19Z) - Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM [53.05486269607166]
multimodal large language models (MLLMs) have significantly enhanced performance across benchmarks.<n>Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training.<n>We analyze multimodal data contamination using our analytical framework, MM-Detect, which defines two contamination categories-unimodal and cross-modal.
arXiv Detail & Related papers (2024-11-06T10:44:15Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)<n>MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.<n>It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z) - MM-BigBench: Evaluating Multimodal Models on Multimodal Content
Comprehension Tasks [56.60050181186531]
We introduce MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions.
Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights.
arXiv Detail & Related papers (2023-10-13T11:57:04Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.