Related papers: MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents

MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents

URL: http://arxiv.org/abs/2601.08235v2
Date: Wed, 14 Jan 2026 05:26:19 GMT
Title: MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents
Authors: Shouju Wang, Haopeng Zhang,
Abstract summary: We introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings.<n> MPCI-Bench consists of paired positive and negative instances derived from the same visual source.<n>We will open-source MPCI-Bench to facilitate future research on agentic CI.
Score: 1.919885803437747
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language-model agents evolve from passive chatbots into proactive assistants that handle personal data, evaluating their adherence to social norms becomes increasingly critical, often through the lens of Contextual Integrity (CI). However, existing CI benchmarks are largely text-centric and primarily emphasize negative refusal scenarios, overlooking multimodal privacy risks and the fundamental trade-off between privacy and utility. In this paper, we introduce MPCI-Bench, the first Multimodal Pairwise Contextual Integrity benchmark for evaluating privacy behavior in agentic settings. MPCI-Bench consists of paired positive and negative instances derived from the same visual source and instantiated across three tiers: normative Seed judgments, context-rich Story reasoning, and executable agent action Traces. Data quality is ensured through a Tri-Principle Iterative Refinement pipeline. Evaluations of state-of-the-art multimodal models reveal systematic failures to balance privacy and utility and a pronounced modality leakage gap, where sensitive visual information is leaked more frequently than textual information. We will open-source MPCI-Bench to facilitate future research on agentic CI.

Related papers

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval [60.25608870901428]
Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs)<n>We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source robustness.
arXiv Detail & Related papers (2026-03-05T18:42:51Z)
Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z)
MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models [14.942122955210436]
Modern Vision-Language Models (VLMs) demonstrate sophisticated reasoning, escalating privacy risks.<n>Current privacy benchmarks are structurally insufficient for this new threat.<n>We propose textbfMultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning.
arXiv Detail & Related papers (2025-11-21T04:33:11Z)
Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework [7.493288948235459]
PRISM is a large-scale synthetic benchmark designed to evaluate cross-modal privacy risks.<n> PRISM is the first multi-modal, multi-dimensional and fine-grained synthesized dataset.<n>We evaluate the inference capabilities of six leading M-LLMs on PRISM.
arXiv Detail & Related papers (2025-11-05T07:23:21Z)
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z)
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional [40.11148315577635]
We present a large-scale empirical study to quantify dependencies across 23 visual question-answering benchmarks using multi-modal large language models (MLLMs)<n>Our findings show that the reliance on vision, question (text), and their interaction varies significantly, both across and within benchmarks.<n>We discover that numerous benchmarks intended to mitigate text-only biases have inadvertently amplified image-only dependencies.<n>This characterization persists across model sizes, as larger models often use these intra-modality dependencies to achieve high performance that mask an underlying lack of multi-modal reasoning.
arXiv Detail & Related papers (2025-09-27T21:13:29Z)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z)
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding [59.50808215134678]
This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs.<n>Results reveal significant limitations in dynamic scene comprehension, cross-modal resilience and real-world risk mitigation.
arXiv Detail & Related papers (2025-06-14T04:04:54Z)
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications [24.832537917472894]
EVADE is the first expert-curated, Chinese, multimodal benchmark designed to evaluate foundation models on evasive content detection in e-commerce.<n>The dataset contains 2,833 annotated text samples and 13,961 images spanning six demanding product categories.
arXiv Detail & Related papers (2025-05-23T09:18:01Z)
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models [51.19622266249408]
MultiTrust is the first comprehensive and unified benchmark on the trustworthiness of MLLMs.<n>Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts.<n>Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks.
arXiv Detail & Related papers (2024-06-11T08:38:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.