Related papers: BABE: Biology Arena BEnchmark

BABE: Biology Arena BEnchmark

URL: http://arxiv.org/abs/2602.05857v1
Date: Thu, 05 Feb 2026 16:39:20 GMT
Title: BABE: Biology Arena BEnchmark
Authors: Junting Zhou, Jin Chen, Linfeng Hao, Denghui Cao, Zheyu Wang, Qiguang Chen, Chaoyou Fu, Jiaze Chen, Yuchen Wu, Ge Zhang, Mingxuan Wang, Wenhao Huang, Tong Yang,
Abstract summary: BABE is a benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems.<n>Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists.
Score: 51.53220868983288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of large language models (LLMs) has expanded their capabilities from basic dialogue to advanced scientific reasoning. However, existing benchmarks in biology often fail to assess a critical skill required of researchers: the ability to integrate experimental results with contextual knowledge to derive meaningful conclusions. To address this gap, we introduce BABE(Biology Arena BEnchmark), a comprehensive benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems. BABE is uniquely constructed from peer-reviewed research papers and real-world biological studies, ensuring that tasks reflect the complexity and interdisciplinary nature of actual scientific inquiry. BABE challenges models to perform causal reasoning and cross-scale inference. Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists, offering a more authentic measure of their potential to contribute to biological research.

Related papers

SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation [24.956743572453153]
We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models.<n>SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions.
arXiv Detail & Related papers (2026-02-26T16:50:28Z)
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows [203.3527268311731]
We present an operational SGI definition grounded in the Practical Inquiry Model (PIM)<n>We operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning.<n>Our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
arXiv Detail & Related papers (2025-12-18T12:44:36Z)
Cross-Disciplinary Knowledge Retrieval and Synthesis: A Compound AI Architecture for Scientific Discovery [1.5143261755366868]
BioSage is a novel compound AI architecture that integrates LLMs with RAG, orchestrated specialized agents and tools to enable discoveries across AI, data science, biomedical, and biosecurity domains.<n>Our system features several specialized agents including the retrieval agent with query planning and response synthesis that enable knowledge retrieval across domains with citation-backed responses.<n>Our ongoing work focuses on multimodal retrieval and reasoning over charts, tables, and structured scientific data, along with developing comprehensive multimodal benchmarks for cross-disciplinary discovery.
arXiv Detail & Related papers (2025-11-23T05:33:11Z)
Hypothesis Hunting with Evolving Networks of Autonomous Scientific Agents [52.50038914857797]
We term this process hypothesis hunting: the cumulative search for insight through sustained exploration across vast and complex hypothesis spaces.<n>We introduce AScience, a framework modeling discovery as the interaction of agents, networks, and evaluation norms, and implement it as ASCollab.<n> Experiments show that such social dynamics enable the accumulation of expert-rated results along the diversity-quality-novelty frontier.
arXiv Detail & Related papers (2025-10-08T08:47:07Z)
Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning [53.82037883518254]
We introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks.<n>We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks.
arXiv Detail & Related papers (2025-08-26T17:04:23Z)
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model [12.528834366422466]
BioReason learns to produce logical, biologically coherent deductions.<n>It boosts KEGG-based disease pathway prediction accuracy from 86% to 98%.<n>It also improves variant effect prediction by an average of 15% over strong baselines.
arXiv Detail & Related papers (2025-05-29T15:49:27Z)
Benchmarking AI scientists in omics data-driven biological research [3.3605177939410713]
We introduce the Biological AI Scientist Benchmark (BaisBench) to assess AI scientists' ability to generate biological discoveries.<n>BaisBench comprises two tasks: cell type annotation on 31 expert-labeled single-cell datasets, and scientific discovery through answering 198 multiple-choice questions.
arXiv Detail & Related papers (2025-05-13T08:33:54Z)
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning [49.487327661584686]
We introduce BioMaze, a dataset with 5.1K complex pathway problems from real research.<n>Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning.<n>To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.
arXiv Detail & Related papers (2025-02-23T17:38:10Z)
LAB-Bench: Measuring Capabilities of Language Models for Biology Research [1.6312096924271486]
We introduce the Language Agent Biology Benchmark (LAB-Bench) It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities. We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
arXiv Detail & Related papers (2024-07-14T23:52:25Z)
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.