Related papers: Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs

URL: http://arxiv.org/abs/2512.09403v1
Date: Wed, 10 Dec 2025 07:57:08 GMT
Title: Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs
Authors: Sohely Jahan, Ruimin Sun,
Abstract summary: We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs.<n>We fine-tune a LLaMA3 8B surrogate via a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data.<n>With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts.
Score: 4.357945243163354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As medical large language models (LLMs) become increasingly integrated into clinical workflows, concerns around alignment robustness, and safety are escalating. Prior work on model extraction has focused on classification models or memorization leakage, leaving the vulnerability of safety-aligned generative medical LLMs underexplored. We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs using only output-level access. By issuing 48,000 instruction queries to Meditron-7B and collecting 25,000 benign instruction response pairs, we fine-tune a LLaMA3 8B surrogate via parameter efficient LoRA under a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data. With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts, far exceeding both Meditron-7B (66%) and the untuned base model (46%). This reveals a pronounced functional-ethical gap, task utility transfers, while alignment collapses. To analyze this collapse, we develop a dynamic adversarial evaluation framework combining Generative Query (GQ)-based harmful prompt generation, verifier filtering, category-wise failure analysis, and adaptive Random Search (RS) jailbreak attacks. We also propose a layered defense system, as a prototype detector for real-time alignment drift in black-box deployments. Our findings show that benign-only black-box distillation exposes a practical and under-recognized threat: adversaries can cheaply replicate medical LLM capabilities while stripping safety mechanisms, underscoring the need for extraction-aware safety monitoring.

Related papers

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z)
What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks for Safety Alignment [4.379304291229695]
We introduce Refusal-Aware Adaptive Injection (RAAI), a training-free, and model-agnostic framework that repurposes LLM attack techniques.<n> RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions.<n>Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average.
arXiv Detail & Related papers (2025-06-07T08:19:01Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs [7.597770587484936]
We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating medical large language models (LLMs) safety in healthcare.<n> CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles to simulate both malicious and benign use cases.<n>Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries.
arXiv Detail & Related papers (2025-05-16T16:25:51Z)
Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking [14.541887120849687]
jailbreak attacks based on prompt engineering have become a major safety threat.<n>This study introduces the concept of Defense Threshold Decay (DTD), revealing the potential safety impact caused by LLMs' benign generation.<n>We propose the Sugar-Coated Poison attack paradigm, which uses a "semantic reversal" strategy to craft benign inputs that are opposite in meaning to malicious intent.
arXiv Detail & Related papers (2025-04-08T03:57:09Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.