Related papers: Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation

Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation

URL: http://arxiv.org/abs/2506.07202v1
Date: Sun, 08 Jun 2025 15:52:38 GMT
Title: Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation
Authors: Ming Liu, Wensheng Zhang,
Abstract summary: Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination risk masking true generalization.<n>We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks.<n>We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization.
Score: 9.434966074326056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to reasoning MLLMs, often fine-tuned via reinforcement learning from potentially contaminated base models. We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks. Instead of perturbing inputs, we perturb the task itself. Using the same visual input, models are evaluated across a family of tasks (e.g., QA, captioning, question posing, verification) to probe diverse capabilities. This task perturbation reveals whether model performance is robust or reliant on superficial task-specific cues. Our approach is analogous to loss landscape sharpness: models overfit or contaminated for a single task (sharp minima) falter under task shifts, unlike models with generalizable solutions (flatter minima). We developed an automated pipeline with a calibrated judge scoring open-ended generations (captions, questions) using paraphrase and corruption sampling. Applying this framework to leading image/video MLLMs on benchmarks including MME, RealWorldQA, and CVRR-ES, we analyze each model's cross-task "ability vector." We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization. Our dynamic task perturbation offers deeper insights into MLLM generalization, distinguishing genuine understanding from spurious leakage or overfitting.

Related papers

When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions [23.5858385520752]
Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions.<n>In practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions.<n>We present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions.
arXiv Detail & Related papers (2025-07-27T23:16:14Z)
LLM Performance for Code Generation on Noisy Tasks [0.41942958779358674]
We show that large language models (LLMs) can solve tasks obfuscated to a level where the text would be unintelligible to human readers.<n>We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets.<n>We propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination.
arXiv Detail & Related papers (2025-05-29T16:11:18Z)
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models [28.20124264650572]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks.<n>They often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA)<n>This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering.<n>We propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentation with both perturbations and adversarial perturbations.
arXiv Detail & Related papers (2025-05-26T07:31:32Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs [38.93090238335506]
Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe pitfall in deep learning models trained on single modality data. We introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases.
arXiv Detail & Related papers (2024-06-24T20:29:16Z)
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.532180752491954]
Large Language Models (LLMs) are often described as instances of foundation models that possess strong generalization obeying scaling laws.<n>We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models claiming strong function.<n>We also observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations.
arXiv Detail & Related papers (2024-06-04T07:43:33Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [61.8876114116716]
Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-related tasks.<n>However, their ability to detect subtle visual spoofing and forgery clues in face attack detection tasks remains underexplored.<n>We introduce a benchmark, SHIELD, to evaluate MLLMs for face spoofing and forgery detection.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Revisit Input Perturbation Problems for LLMs: A Unified Robustness Evaluation Framework for Noisy Slot Filling Task [18.623619585980688]
We propose a unified robustness evaluation framework based on the slot-filling task to evaluate the dialogue understanding capability of large language models. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios.
arXiv Detail & Related papers (2023-10-10T10:22:05Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.