Related papers: Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

URL: http://arxiv.org/abs/2602.04739v1
Date: Wed, 04 Feb 2026 16:42:02 GMT
Title: Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases
Authors: Casey Ford, Madison Van Doren, Emily Dix,
Abstract summary: Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored.<n>We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.

Related papers

In-Context Environments Induce Evaluation-Awareness in Language Models [0.12691047660244334]
Humans often become more self-aware under threat, yet can lose self-awareness when absorbed in a task.<n>We introduce a black-box adversarial optimization framework treating the in-context prompt as an optimizable environment.<n>We show that adversarially optimized prompts pose a substantially greater threat to evaluation reliability than previously understood.
arXiv Detail & Related papers (2026-03-04T08:22:02Z)
Evaluating Performance Drift from Model Switching in Multi-Turn LLM Systems [0.5833117322405447]
Deployed multi-turn LLM systems routinely switch models mid-interaction due to upgrades, cross-provider routing, and fallbacks.<n>We introduce a switch-matrix benchmark that measures the effect by running a prefix model for early turns and a suffix model for the final turn.<n>Even a single-turn handoff yields prevalent and statistically significant, directional effects.
arXiv Detail & Related papers (2026-03-03T15:44:57Z)
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5 [101.4233736714284]
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have driven major gains in reasoning, perception, and generation across language and vision.<n>We present an integrated safety evaluation of six frontier models--GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5--assessing each across language, vision-language, and image generation.
arXiv Detail & Related papers (2026-01-15T15:52:52Z)
Parrot: Persuasion and Agreement Robustness Rating of Output Truth -- A Sycophancy Robustness Benchmark for LLMs [0.0]
PARROT (Persuasion and Agreement Robustness Rating of Output Truth) is a robustness focused framework designed to measure the degradation in accuracy under social pressure exerted on users.<n>We evaluate 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates.
arXiv Detail & Related papers (2025-11-21T13:01:28Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
Scheming Ability in LLM-to-LLM Strategic Interactions [4.873362301533824]
Large language model (LLM) agents are deployed autonomously in diverse contexts.<n>We investigate the ability and propensity of frontier LLM agents through two game-theoretic frameworks.<n>Tests four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b)
arXiv Detail & Related papers (2025-10-11T04:42:29Z)
SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z)
Red Teaming Multimodal Language Models: Evaluating Harm Across Prompt Modalities and Models [0.0]
Multimodal large language models (MLLMs) are increasingly used in real world applications, yet their safety under adversarial conditions remains underexplored.<n>This study evaluates the harmlessness of four leading MLLMs when exposed to adversarial prompts across text-only and multimodal formats.
arXiv Detail & Related papers (2025-09-18T22:51:06Z)
Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency [10.123156884375202]
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies.<n>We present Finch-Zk, a black-box framework that leverages fine-grained cross-model consistency to detect and mitigate hallucinations.
arXiv Detail & Related papers (2025-08-19T23:45:34Z)
When Punctuation Matters: A Large-Scale Comparison of Prompt Robustness Methods for LLMs [55.20230501807337]
We present the first systematic evaluation of 5 methods for improving prompt robustness within a unified experimental framework.<n>We benchmark these techniques on 8 models from Llama, Qwen and Gemma families across 52 tasks from Natural Instructions dataset.
arXiv Detail & Related papers (2025-08-15T10:32:50Z)
MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks [85.3303135160762]
MIRAGE is a novel framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models.<n>It achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines.<n>We demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards.
arXiv Detail & Related papers (2025-03-24T20:38:42Z)
Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models [45.63440666848143]
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities.<n>Despite their success, MLLMs remain vulnerable to conversational adversarial inputs.<n>We study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames. It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.