Ethical Risks in Deploying Large Language Models: An Evaluation of Medical Ethics Jailbreaking
- URL: http://arxiv.org/abs/2601.12652v1
- Date: Mon, 19 Jan 2026 01:52:34 GMT
- Title: Ethical Risks in Deploying Large Language Models: An Evaluation of Medical Ethics Jailbreaking
- Authors: Chutian Huang, Dake Cao, Jiacheng Ji, Yunlou Fan, Chengze Yan, Hanhui Xu,
- Abstract summary: Malicious prompt engineering specifically "jailbreak attacks" poses severe security risks by inducing models to bypass internal safety mechanisms.<n>Current benchmarks predominantly focus on public safety and Western cultural norms, leaving a critical gap in evaluating the niche, high-risk domain of medical ethics within the Chinese context.<n>We evaluated seven prominent models (e.g., GPT-5, Claude-Sonnet-4-Reasoning, DeepSeek-R1) using a "role-playing + scenario simulation + multi-turn dialogue" vector within the DeepInception framework.
- Score: 0.49259062564301753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: While Large Language Models (LLMs) have achieved widespread adoption, malicious prompt engineering specifically "jailbreak attacks" poses severe security risks by inducing models to bypass internal safety mechanisms. Current benchmarks predominantly focus on public safety and Western cultural norms, leaving a critical gap in evaluating the niche, high-risk domain of medical ethics within the Chinese context. Objective: To establish a specialized jailbreak evaluation framework for Chinese medical ethics and to systematically assess the defensive resilience and ethical alignment of seven prominent LLMs when subjected to sophisticated adversarial simulations. Methodology: We evaluated seven prominent models (e.g., GPT-5, Claude-Sonnet-4-Reasoning, DeepSeek-R1) using a "role-playing + scenario simulation + multi-turn dialogue" vector within the DeepInception framework. The testing focused on eight high-risk themes, including commercial surrogacy and organ trading, utilizing a hierarchical scoring matrix to quantify the Attack Success Rate (ASR) and ASR Gain. Results: A systemic collapse of defenses was observed, whereas models demonstrated high baseline compliance, the jailbreak ASR reached 82.1%, representing an ASR Gain of over 80 percentage points. Claude-Sonnet-4-Reasoning emerged as the most robust model, while five models including Gemini-2.5-Pro and GPT-4.1 exhibited near-total failure with ASRs between 96% and 100%. Conclusions: Current LLMs are highly vulnerable to contextual manipulation in medical ethics, often prioritizing "helpfulness" over safety constraints. To enhance security, we recommend a transition from outcome to process supervision, the implementation of multi-factor identity verification, and the establishment of cross-model "joint defense" mechanisms.
Related papers
- Improving the Safety and Trustworthiness of Medical AI via Multi-Agent Evaluation Loops [1.412167203558403]
Large Language Models (LLMs) are increasingly applied in healthcare, yet ensuring their ethical integrity and safety compliance remains a major barrier to clinical deployment.<n>This work introduces a multi-agent refinement framework designed to enhance the safety and reliability of medical LLMs through structured, iterative alignment.
arXiv Detail & Related papers (2026-01-19T18:10:34Z) - Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models [0.0]
This study employed the TEMPEST multi-turn attack framework to evaluate ten frontier models from eight vendors across 1,000 harmful behaviors.<n>Six models achieved 96% to 100% attack success rate (ASR), while four showed meaningful resistance, with ASR ranging from 42% to 78%.
arXiv Detail & Related papers (2025-12-08T00:30:40Z) - SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - Diagnosing Hallucination Risk in AI Surgical Decision-Support: A Sequential Framework for Sequential Validation [5.469454486414467]
Large language models (LLMs) offer transformative potential for clinical decision support in spine surgery.<n>LLMs pose significant risks through hallucinations, which are factually inconsistent or contextually misaligned outputs.<n>This study introduces a clinician-centered framework to quantify hallucination risks by evaluating diagnostic precision, recommendation quality, reasoning robustness, output coherence, and knowledge alignment.
arXiv Detail & Related papers (2025-11-01T15:25:55Z) - SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z) - RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration [81.38705556267917]
Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations.<n>We introduce a theoretical framework that reconstructs the underlying risk concept space.<n>We propose RADAR, a multi-agent collaborative evaluation framework.
arXiv Detail & Related papers (2025-09-28T09:35:32Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Exploring the Secondary Risks of Large Language Models [26.00748215572094]
We introduce secondary risks marked by harmful or misleading behaviors during benign prompts.<n>Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms.<n>We propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors.
arXiv Detail & Related papers (2025-06-14T07:31:52Z) - SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies [11.0505830548286]
This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench.<n> Evaluation of 10 leading models reveals vulnerabilities, despite achieving 0.86 accuracy in medical knowledge recall.<n>Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning.
arXiv Detail & Related papers (2025-03-10T13:28:25Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.