Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture
- URL: http://arxiv.org/abs/2512.08130v1
- Date: Tue, 09 Dec 2025 00:16:44 GMT
- Title: Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture
- Authors: Gary Ackerman, Brandon Behlendorf, Zachary Kallenborn, Sheriff Almakki, Doug Clifford, Jenna LaTourette, Hayley Peterson, Noah Sheinbaum, Olivia Shoemaker, Anna Wetzel,
- Abstract summary: This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework.<n>The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models.<n>As a pilot, the BBG is first being developed to address bacterial biological threats only.
- Score: 0.38186458149494623
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Both model developers and policymakers seek to quantify and mitigate the risk of rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons. An important element of such efforts is the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework. The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models, while accounting for key aspects of the threat itself that are often overlooked in other benchmarking efforts, including different actor capability levels, and operational (in addition to purely technical) risk factors. As a pilot, the BBG is first being developed to address bacterial biological threats only. The BBG is built upon a hierarchical structure of biothreat categories, elements and tasks, which then serves as the basis for the development of task-aligned queries. This paper outlines the development of this biothreat task-query architecture, which we have named the Bacterial Biothreat Schema, while future papers will describe follow-on efforts to turn queries into model prompts, as well as how the resulting benchmarks can be implemented for model evaluation. Overall, the BBG Framework, including the Bacterial Biothreat Schema, seeks to offer a robust, re-usable structure for evaluating bacterial biological risks arising from LLMs across multiple levels of aggregation, which captures the full scope of technical and operational requirements for biological adversaries, and which accounts for a wide spectrum of biological adversary capabilities.
Related papers
- Benchmarking Knowledge-Extraction Attack and Defense on Retrieval-Augmented Generation [50.87199039334856]
Retrieval-Augmented Generation (RAG) has become a cornerstone of knowledge-intensive applications.<n>Recent studies show that knowledge-extraction attacks can recover sensitive knowledge-base content through maliciously crafted queries.<n>We introduce the first systematic benchmark for knowledge-extraction attacks on RAG systems.
arXiv Detail & Related papers (2026-02-10T01:27:46Z) - Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset [0.38186458149494623]
This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset.<n>It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework.<n>Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM.
arXiv Detail & Related papers (2025-12-09T10:31:02Z) - Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process [0.38186458149494623]
This paper describes the second component of a novel Biothreat Benchmark Generation framework: the generation of the Bacterial Biothreat Benchmark dataset.<n>The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora.<n>A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks.
arXiv Detail & Related papers (2025-12-09T10:24:25Z) - Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop [18.00029758641004]
We aim to accelerate the development of robust benchmarks for AI driven Virtual Cells.<n>These benchmarks are crucial for ensuring rigor, relevance, and biological relevance.<n>These benchmarks will advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.
arXiv Detail & Related papers (2025-07-14T17:25:28Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety [296.5392512998251]
We present a comprehensive taxonomy of safety threats to large models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats.<n>We identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices.
arXiv Detail & Related papers (2025-02-02T05:14:22Z) - The Reality of AI and Biorisk [24.945718952309157]
It is necessary to have both a sound theoretical threat model for how AI models or systems could increase biorisk and a robust method for testing that threat model.<n>This paper provides an analysis of existing available research surrounding two AI and biorisk threat models.
arXiv Detail & Related papers (2024-12-02T20:14:46Z) - EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Bio-SIEVE: Exploring Instruction Tuning Large Language Models for
Systematic Review Automation [6.452837513222072]
Large Language Models (LLMs) can support and be trained to perform literature screening for medical systematic reviews.
Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches.
We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process.
arXiv Detail & Related papers (2023-08-12T16:56:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.