Related papers: Quantifying CBRN Risk in Frontier Models

Quantifying CBRN Risk in Frontier Models

URL: http://arxiv.org/abs/2510.21133v1
Date: Fri, 24 Oct 2025 03:55:24 GMT
Title: Quantifying CBRN Risk in Frontier Models
Authors: Divyanshu Kumar, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, Prashanth Harshangi,
Abstract summary: Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge.<n>We present the first comprehensive evaluation of 10 leading commercial LLMs against a novel CBRN dataset and a 180-prompt subset of the FORTRESS benchmark.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0\% success versus 33.8\% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2\% (claude-opus-4) to 96\% (mistral-small-latest) attack success rates; and eight models exceed 70\% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information. These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.

Related papers

SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond [134.43113804188195]
We introduce SafeSci, a comprehensive framework for safety evaluation and enhancement in scientific contexts.<n>SafeSci comprises SafeSciBench, a multi-disciplinary benchmark with 0.25M samples, and SafeSciTrain, a large-scale dataset containing 1.5M samples for safety enhancement.
arXiv Detail & Related papers (2026-03-02T08:16:04Z)
Self-Guard: Defending Large Reasoning Models via enhanced self-reflection [54.775612141528164]
Self-Guard is a lightweight safety defense framework for Large Reasoning Models.<n>It bridges the awareness-compliance gap, achieving robust safety performance without compromising model utility.<n>Self-Guard exhibits strong generalization across diverse unseen risks and varying model scales.
arXiv Detail & Related papers (2026-01-31T13:06:11Z)
What Matters For Safety Alignment? [38.86339753409445]
This paper presents a comprehensive empirical study on the safety alignment capabilities of AI systems.<n>We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques.<n>We identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models.
arXiv Detail & Related papers (2026-01-07T12:31:52Z)
ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z)
Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation [13.971909819796762]
Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity.<n>Embedding space poisoning is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms.<n>We propose ETTA, a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations.
arXiv Detail & Related papers (2025-07-08T03:01:00Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety [5.544163262906087]
Current benchmarks often fail to test safeguard robustness to potential national security and public safety risks.<n>We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions.<n>Each prompt-rubric pair has a corresponding benign version to test for model over-refusals.
arXiv Detail & Related papers (2025-06-17T19:08:02Z)
SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge [11.63268709958876]
SOSBench is a regulation-grounded, hazard-focused benchmark for large language models.<n>It covers six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology.<n>We evaluate frontier models within a unified evaluation framework using our SOSBench.
arXiv Detail & Related papers (2025-05-27T17:47:08Z)
Safety Pretraining: Toward the Next Generation of Safe AI [68.99129474671282]
We present a data-centric pretraining framework that builds safety into the model from the start.<n>Our framework consists of four key steps: Safety Filtering, Safety Rephrasing, Native Refusal and Harmfulness-Tag annotated pretraining.<n>Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% on standard LLM safety benchmarks with no performance on general degradation tasks.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
An Approach to Technical AGI Safety and Security [72.83728459135101]
We develop an approach to address the risk of harms consequential enough to significantly harm humanity.<n>We focus on technical approaches to misuse and misalignment.<n>We briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
arXiv Detail & Related papers (2025-04-02T15:59:31Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.