Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process
- URL: http://arxiv.org/abs/2512.08451v1
- Date: Tue, 09 Dec 2025 10:24:25 GMT
- Title: Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process
- Authors: Gary Ackerman, Zachary Kallenborn, Anna Wetzel, Hayley Peterson, Jenna LaTourette, Olivia Shoemaker, Brandon Behlendorf, Sheriff Almakki, Doug Clifford, Noah Sheinbaum,
- Abstract summary: This paper describes the second component of a novel Biothreat Benchmark Generation framework: the generation of the Bacterial Biothreat Benchmark dataset.<n>The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora.<n>A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks.
- Score: 0.38186458149494623
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The potential for rapidly-evolving frontier artificial intelligence (AI) models, especially large language models (LLMs), to facilitate bioterrorism or access to biological weapons has generated significant policy, academic, and public concern. Both model developers and policymakers seek to quantify and mitigate any risk, with an important element of such efforts being the development of model benchmarks that can assess the biosecurity risk posed by a particular model. This paper, the second in a series of three, describes the second component of a novel Biothreat Benchmark Generation (BBG) framework: the generation of the Bacterial Biothreat Benchmark (B3) dataset. The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora, to generate over 7,000 potential benchmarks linked to the Task-Query Architecture that was developed during the first component of the project. A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks. This procedure ensured that these benchmarks are a) diagnostic in terms of providing uplift; b) directly relevant to biosecurity threats; and c) are aligned with a larger biosecurity architecture permitting nuanced analysis at different levels of analysis.
Related papers
- TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models [56.94569090844015]
TokaMark is a structured benchmark to evaluate AI models on real experimental data collected from the Mega Ampere Spherical Tokamak (MAST)<n>TokaMark aims to accelerate progress in data-driven AI-based plasma modeling, contributing to the broader goal of achieving sustainable and stable fusion energy.
arXiv Detail & Related papers (2026-02-05T16:49:44Z) - Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models III: Implementing the Bacterial Biothreat Benchmark (B3) Dataset [0.38186458149494623]
This paper discusses the pilot implementation of the Bacterial Biothreat Benchmark (B3) dataset.<n>It is the third in a series of three papers describing an overall Biothreat Benchmark Generation (BBG) framework.<n>Overall, the pilot demonstrated that the B3 dataset offers a viable, nuanced method for rapidly assessing the biosecurity risk posed by a LLM.
arXiv Detail & Related papers (2025-12-09T10:31:02Z) - Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models I: The Task-Query Architecture [0.38186458149494623]
This paper describes the first component of a novel Biothreat Benchmark Generation (BBG) Framework.<n>The BBG approach is designed to help model developers and evaluators reliably measure and assess the biosecurity risk uplift and general harm potential of existing and future AI models.<n>As a pilot, the BBG is first being developed to address bacterial biological threats only.
arXiv Detail & Related papers (2025-12-09T00:16:44Z) - Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance? [2.010294990327175]
Current AI evaluation practices depend heavily on established benchmarks.<n>This research addresses the urgent need to quantify this "benchmark-regulation gap"<n>Our findings reveal a profound misalignment: the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities.
arXiv Detail & Related papers (2025-08-07T15:03:39Z) - Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop [18.00029758641004]
We aim to accelerate the development of robust benchmarks for AI driven Virtual Cells.<n>These benchmarks are crucial for ensuring rigor, relevance, and biological relevance.<n>These benchmarks will advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.
arXiv Detail & Related papers (2025-07-14T17:25:28Z) - DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z) - LLMs Outperform Experts on Challenging Biology Benchmarks [0.0]
This study systematically evaluates 27 frontier Large Language Models on eight biology benchmarks.<n>Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test.<n>Several models now match or exceed expert-level performance on other challenging benchmarks.
arXiv Detail & Related papers (2025-05-09T15:05:57Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.