Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
- URL: http://arxiv.org/abs/2512.03121v1
- Date: Tue, 02 Dec 2025 14:11:51 GMT
- Title: Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
- Authors: Ziyi Tong, Feifei Sun, Le Minh Nguyen,
- Abstract summary: Logit-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs)<n>We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings.
- Score: 3.9448289587779404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.
Related papers
- Membership Inference on LLMs in the Wild [7.333405847597631]
Membership Inference Attacks (MIAs) act as a crucial auditing tool for the opaque training data of Large Language Models (LLMs)<n>We propose SimMIA, a robust MIA framework tailored for this text-only regime by leveraging an advanced sampling strategy and scoring mechanism.<n>We present WikiMIA-25, a new benchmark curated to evaluate MIA performance on modern proprietary LLMs.
arXiv Detail & Related papers (2026-01-16T14:10:46Z) - PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs [59.78917775399492]
Multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability.<n>We propose a training-free framework to mitigate this degradation.
arXiv Detail & Related papers (2026-01-12T15:27:51Z) - Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z) - OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models [8.88331104584743]
OpenLVLM-MIA is a new benchmark that highlights fundamental challenges in evaluating membership inference attacks (MIA) against large vision-language models (LVLMs)<n>We introduce a controlled benchmark of 6,000 images where the distributions of member and non-member samples are carefully balanced, and ground-truth membership labels are provided across three distinct training stages.<n> Experiments using OpenLVLM-MIA demonstrated that the performance of state-of-the-art MIA methods converged to random chance under unbiased conditions.
arXiv Detail & Related papers (2025-10-18T01:39:28Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models [19.361686225381447]
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL)<n>We propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer.
arXiv Detail & Related papers (2025-06-09T16:55:32Z) - MLLM-CL: Continual Learning for Multimodal Large Language Models [39.19456474036905]
We introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning.<n>We propose preventing catastrophic interference through parameter isolation and an MLLM-based routing mechanism.<n>Our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.
arXiv Detail & Related papers (2025-06-05T17:58:13Z) - Distilling Transitional Pattern to Large Language Models for Multimodal Session-based Recommendation [67.84581846180458]
Session-based recommendation (SBR) predicts the next item based on anonymous sessions.<n>Recent Multimodal SBR methods utilize simplistic pre-trained models for modality learning but have limitations in semantic richness.<n>We propose a multimodal LLM-enhanced framework TPAD, which extends a distillation paradigm to decouple and align transitional patterns for promoting MSBR.
arXiv Detail & Related papers (2025-04-13T07:49:08Z) - FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [56.08867996209236]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.<n>We introduce a benchmark to evaluate the performance of federated fine-tuning of MLLMs across various multimodal heterogeneous scenarios.<n>We develop a general FedMLLM framework that integrates classic FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Towards Robust Multimodal Sentiment Analysis with Incomplete Data [20.75292807497547]
We present an innovative Language-dominated Noise-resistant Learning Network (LNLN) to achieve robust Multimodal Sentiment Analysis (MSA)
LNLN features a dominant modality correction (DMC) module and dominant modality based multimodal learning (DMML) module, which enhances the model's robustness across various noise scenarios.
arXiv Detail & Related papers (2024-09-30T07:14:31Z) - Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark.
We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Large Language Models Are Latent Variable Models: Explaining and Finding
Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning.
This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.