A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment
- URL: http://arxiv.org/abs/2510.09615v1
- Date: Sat, 13 Sep 2025 23:54:54 GMT
- Title: A Biosecurity Agent for Lifecycle LLM Biosecurity Alignment
- Authors: Meiyin Meng, Zaixi Zhang,
- Abstract summary: This study shows a Biosecurity Agent that comprises four coordinated modes across the model lifecycle.<n>For dataset sanitization (Mode 1), evaluation is conducted on CORD-19, a COVID-19 Open Research dataset of coronavirus-related articles.<n>For preference alignment (Mode 2), DPO with LoRA adapters internalizes refusals and safe completions, reducing end-to-end attack success rate (ASR) from 59.7% to 3.0%.<n>At inference (Mode 3), run-time guardrails across L1-L3 show the expected security-usability trade-off.
- Score: 13.707244322057834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are increasingly integrated into biomedical research workflows--from literature triage and hypothesis generation to experimental design--yet this expanded utility also heightens dual-use concerns, including the potential misuse for guiding toxic compound synthesis. In response, this study shows a Biosecurity Agent that comprises four coordinated modes across the model lifecycle: dataset sanitization, preference alignment, run-time guardrails, and automated red teaming. For dataset sanitization (Mode 1), evaluation is conducted on CORD-19, a COVID-19 Open Research Dataset of coronavirus-related scholarly articles. We define three sanitization tiers--L1 (compact, high-precision), L2 (human-curated biosafety terms), and L3 (comprehensive union)--with removal rates rising from 0.46% to 70.40%, illustrating the safety-utility trade-off. For preference alignment (Mode 2), DPO with LoRA adapters internalizes refusals and safe completions, reducing end-to-end attack success rate (ASR) from 59.7% to 3.0%. At inference (Mode 3), run-time guardrails across L1-L3 show the expected security-usability trade-off: L2 achieves the best balance (F1 = 0.720, precision = 0.900, recall = 0.600, FPR =0.067), while L3 offers stronger jailbreak resistance at the cost of higher false positives. Under continuous automated red-teaming (Mode 4), no successful jailbreaks are observed under the tested protocol. Taken together, our biosecurity agent offers an auditable, lifecycle-aligned framework that reduces attack success while preserving benign utility, providing safeguards for the use of LLMs in scientific research and setting a precedent for future agent-level security protections.
Related papers
- BadCLIP++: Stealthy and Persistent Backdoors in Multimodal Contrastive Learning [73.46118996284888]
Research on backdoor attacks against multimodal contrastive learning models faces two key challenges: stealthiness and persistence.<n>We propose BadCLIP++, a unified framework that tackles both challenges.<n>For stealthiness, we introduce a semantic-fusion QR micro-trigger that embeds imperceptible patterns near task-relevant regions.<n>For persistence, we stabilize trigger embeddings via radius shrinkage and centroid alignment.
arXiv Detail & Related papers (2026-02-19T08:31:16Z) - Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z) - Improving the Safety and Trustworthiness of Medical AI via Multi-Agent Evaluation Loops [1.412167203558403]
Large Language Models (LLMs) are increasingly applied in healthcare, yet ensuring their ethical integrity and safety compliance remains a major barrier to clinical deployment.<n>This work introduces a multi-agent refinement framework designed to enhance the safety and reliability of medical LLMs through structured, iterative alignment.
arXiv Detail & Related papers (2026-01-19T18:10:34Z) - What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z) - Quantifying Return on Security Controls in LLM Systems [0.0]
This paper introduces a decision-oriented framework to quantify residual risk.<n>It converts adversarial probe outcomes into financial risk estimates and return-on-control metrics.
arXiv Detail & Related papers (2025-12-17T04:58:09Z) - Black-Box Behavioral Distillation Breaks Safety Alignment in Medical LLMs [4.357945243163354]
We present a black-box distillation attack that replicates the domain-specific reasoning of safety-aligned medical LLMs.<n>We fine-tune a LLaMA3 8B surrogate via a zero-alignment supervision setting, requiring no access to model weights, safety filters, or training data.<n>With a cost of $12, the surrogate achieves strong fidelity on benign inputs while producing unsafe completions for 86% of adversarial prompts.
arXiv Detail & Related papers (2025-12-10T07:57:08Z) - SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models [66.71948519280669]
Multimodal Large Reasoning Models (MLRMs) demonstrate impressive crossmodal reasoning but often amplify safety risks under adversarial prompts.<n> Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models to implicit risks.<n>We propose SaFeR-VLM, which integrates four components and supports dynamic and interpretable safety decisions beyond surface-level filtering.
arXiv Detail & Related papers (2025-10-08T10:39:12Z) - ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal [27.26251627767238]
Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures.<n>This paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals.
arXiv Detail & Related papers (2025-08-15T05:03:26Z) - Retrieval Augmented Large Language Model System for Comprehensive Drug Contraindications [0.0]
The versatility of large language models (LLMs) has been explored across various sectors, but their application in healthcare poses challenges.<n>This study enhances the capability of LLMs to address contraindications effectively by implementing a Retrieval Augmented Generation (RAG) pipeline.
arXiv Detail & Related papers (2025-08-08T09:09:03Z) - VSF-Med:A Vulnerability Scoring Framework for Medical Vision-Language Models [6.390468088226493]
We introduce VSF--Med, an end-to-end vulnerability-scoring framework for medical Vision Language Models (VLMs)<n>VSF--Med synthesizes over 30,000 adversarial variants from 5,000 radiology images and enables reproducible benchmarking of any medical VLM with a single command.<n>We show that Llama-3.2-11B-Vision-Instruct exhibits a peak vulnerability increase of $1.29sigma$ for persistence-of-attack-effects, while GPT-4o shows increases of $0.69sigma$ for that same vector and $0.28sigma$ for prompt-injection attacks.
arXiv Detail & Related papers (2025-06-25T02:56:38Z) - CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning [2.549390156222399]
The integration of large language models into cyber security applications presents both opportunities and critical safety risks.<n>We introduce CyberLLMInstruct, a dataset of 54,928 pseudo-malicious instruction-response pairs spanning cyber security tasks.
arXiv Detail & Related papers (2025-03-12T12:29:27Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering.
We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z) - ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability.<n>Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs.<n>It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z) - WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.