LLMs Outperform Experts on Challenging Biology Benchmarks
- URL: http://arxiv.org/abs/2505.06108v3
- Date: Wed, 21 May 2025 20:34:27 GMT
- Title: LLMs Outperform Experts on Challenging Biology Benchmarks
- Authors: Lennart Justen,
- Abstract summary: This study systematically evaluates 27 frontier Large Language Models on eight biology benchmarks.<n>Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test.<n>Several models now match or exceed expert-level performance on other challenging benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study systematically evaluates 27 frontier Large Language Models on eight biology benchmarks spanning molecular biology, genetics, cloning, virology, and biosecurity. Models from major AI developers released between November 2022 and April 2025 were assessed through ten independent runs per benchmark. The findings reveal dramatic improvements in biological capabilities. Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test over the study period, with OpenAI's o3 now performing twice as well as expert virologists. Several models now match or exceed expert-level performance on other challenging benchmarks, including the biology subsets of GPQA and WMDP and LAB-Bench CloningScenarios. Contrary to expectations, chain-of-thought did not substantially improve performance over zero-shot evaluation, while extended reasoning features in o3-mini and Claude 3.7 Sonnet typically improved performance as predicted by inference scaling. Benchmarks such as PubMedQA and the MMLU and WMDP biology subsets exhibited performance plateaus well below 100%, suggesting benchmark saturation and errors in the underlying benchmark data. The analysis highlights the need for more sophisticated evaluation methodologies as AI systems continue to advance.
Related papers
- When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.66788281323414]
We analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers.<n>Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age.<n>Expert-curated benchmarks resist saturation better than crowdsourced ones.
arXiv Detail & Related papers (2026-02-18T16:51:37Z) - BABE: Biology Arena BEnchmark [51.53220868983288]
BABE is a benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems.<n>Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists.
arXiv Detail & Related papers (2026-02-05T16:39:20Z) - EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning [0.42970700836450487]
In health economics, systematic literature reviews depend on the correct identification of publications that use the EQ-5D.<n>Manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent.<n>In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models.
arXiv Detail & Related papers (2026-01-30T20:10:34Z) - Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models [7.8780007697387235]
We introduce BioPulse-QA, a benchmark that evaluates large language models (LLMs) on answering questions from newly published biomedical documents.<n>We evaluate four LLMs - GPT-o1, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents.
arXiv Detail & Related papers (2026-01-19T00:38:33Z) - Biothreat Benchmark Generation Framework for Evaluating Frontier AI Models II: Benchmark Generation Process [0.38186458149494623]
This paper describes the second component of a novel Biothreat Benchmark Generation framework: the generation of the Bacterial Biothreat Benchmark dataset.<n>The development process involved three complementary approaches: 1) web-based prompt generation, 2) red teaming, and 3) mining existing benchmark corpora.<n>A process of de-duplication, followed by an assessment of uplift diagnosticity, and general quality control measures, reduced the candidates to a set of 1,010 final benchmarks.
arXiv Detail & Related papers (2025-12-09T10:24:25Z) - Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM-Based and Supervised Approaches [5.958100741754613]
We evaluated large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas.<n>We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning.<n>The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79.
arXiv Detail & Related papers (2025-12-05T08:49:57Z) - SynthPert: Enhancing LLM Biological Reasoning via Synthetic Reasoning Traces for Cellular Perturbation Prediction [3.8698178563798113]
Synthetic reasoning traces effectively distill biological knowledge even when partially inaccurate.<n>This approach enables cross-cell-type generalization with 87% accuracy on unseen RPE1 cells.
arXiv Detail & Related papers (2025-09-29T18:02:41Z) - scE$^2$TM: Toward Interpretable Single-Cell Embedding via Topic Modeling [21.79077173300944]
We present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation.<n>Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains.
arXiv Detail & Related papers (2025-07-11T07:15:13Z) - DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z) - Benchmarking AI scientists in omics data-driven biological research [3.3605177939410713]
We introduce the Biological AI Scientist Benchmark (BaisBench) to assess AI scientists' ability to generate biological discoveries.<n>BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions.
arXiv Detail & Related papers (2025-05-13T08:33:54Z) - CellVerse: Do Large Language Models Really Understand Cell Biology? [74.34984441715517]
We introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data.<n>We systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse.
arXiv Detail & Related papers (2025-05-09T06:47:23Z) - MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z) - BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology [0.8061245870721293]
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research.<n>We present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis.<n>We evaluate the performance of two frontier LLMs using a custom agent framework we open source.
arXiv Detail & Related papers (2025-02-28T18:47:57Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - SemiHVision: Enhancing Medical Multimodal Models with a Semi-Human Annotated Dataset and Fine-Tuned Instruction Generation [13.672776832197918]
Multimodal large language models (MLLMs) have made significant strides, yet they face challenges in the medical domain due to limited specialized knowledge.
We seek to address this gap at various stages of the end-to-end learning pipeline, including data collection, model fine-tuning, and evaluation.
arXiv Detail & Related papers (2024-10-19T02:35:35Z) - Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all [1.507700065820919]
Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights.
No benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis.
This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks.
arXiv Detail & Related papers (2024-10-17T18:27:51Z) - Phikon-v2, A large and public feature extractor for biomarker prediction [42.52549987351643]
We train a vision transformer using DINOv2 and publicly release one iteration of this model for further experimentation, coined Phikon-v2.
While trained on publicly available histology slides, Phikon-v2 surpasses our previously released model (Phikon) and performs on par with other histopathology foundation models (FM) trained on proprietary data.
arXiv Detail & Related papers (2024-09-13T20:12:29Z) - Leveraging Vision Language Models for Specialized Agricultural Tasks [19.7240633020344]
We present AgEval, a benchmark for assessing Vision Language Models' capabilities in plant stress phenotyping.<n>Our study explores how general-purpose VLMs can be leveraged for domain-specific tasks with only a few annotated examples.<n>Our results demonstrate VLMs' rapid adaptability to specialized tasks, with the best-performing model showing an increase in F1 scores from 46.24% to 73.37% in 8-shot identification.
arXiv Detail & Related papers (2024-07-29T00:39:51Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.