Related papers: BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

URL: http://arxiv.org/abs/2505.07889v2
Date: Thu, 29 May 2025 07:31:28 GMT
Title: BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Authors: Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Li Yuan, Yonghong Tian,
Abstract summary: We present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning.<n>Built upon 27K original protocols, it yields nearly 556K high-quality structured instances.
Score: 31.739027752007928
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Biological protocols are fundamental to reproducibility and safety in life science research. While large language models (LLMs) perform well on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning. While there are several benchmark tasks involving protocol question answering, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs. Experimental results reveal that some models perform well on basic understanding tasks (e.g., \sim70% PQA-Acc., >64% ERR F1), but struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons show diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, BioProBench, through its task design and experimental findings, systematically reveals the fundamental challenges for current LLMs in procedural knowledge understanding, deep adaptability to specific domains, reliability of structured reasoning, and handling of sophisticated precision and safety constraints, providing key directions for future AI in the field of scientific experiment automation. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.

Related papers

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z)
Benchmarking Large Language Models on Multiple Tasks in Bioinformatics NLP with Prompting [17.973195066083797]
Large language models (LLMs) have become important tools in solving biological problems.<n>We introduce a comprehensive prompting-based benchmarking framework, termed Bio-benchmark.<n>We evaluate six mainstream LLMs, including GPT-4o and Llama-3.1-70b, using 0-shot and few-shot Chain-of-Thought settings.
arXiv Detail & Related papers (2025-03-06T02:01:59Z)
BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology [0.8061245870721293]
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research.<n>We present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis.<n>We evaluate the performance of two frontier LLMs using a custom agent framework we open source.
arXiv Detail & Related papers (2025-02-28T18:47:57Z)
BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning [49.487327661584686]
We introduce BioMaze, a dataset with 5.1K complex pathway problems from real research.<n>Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning.<n>To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.
arXiv Detail & Related papers (2025-02-23T17:38:10Z)
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z)
COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models [56.81513758682858]
COMET aims to evaluate models across single-omics, cross-omics, and multi-omics tasks.<n>First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins.<n>Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method.
arXiv Detail & Related papers (2024-12-13T18:42:00Z)
ProtoMed-LLM: An Automatic Evaluation Framework for Large Language Models in Medical Protocol Formulation [0.5266869303483376]
Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT)<n>We propose a flexible, automatic framework to evaluate LLMs' capability on SPFT: ProtoMed-LLM.<n>We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini.
arXiv Detail & Related papers (2024-10-06T19:28:55Z)
LAB-Bench: Measuring Capabilities of Language Models for Biology Research [1.6312096924271486]
We introduce the Language Agent Biology Benchmark (LAB-Bench) It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities. We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
arXiv Detail & Related papers (2024-07-14T23:52:25Z)
An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z)
ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology. We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective. Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)
BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology [41.952424120054914]
Large Language Models (LLMs) have impressive capabilities on a wide range of tasks. We present an automatic evaluation framework for the task of planning experimental protocols. We evaluate GPT-3 and GPT-4 on this task and explore their robustness.
arXiv Detail & Related papers (2023-10-16T17:54:20Z)
BELB: a Biomedical Entity Linking Benchmark [3.9648178546218817]
We review recent work in the field and find that the task is absent from existing benchmarks for biomedical text mining. We develop BELB, a Biomedical Entity Linking Benchmark, providing access in a unified format to 11 corpora linked to 7 knowledge bases. Using BELB we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models.
arXiv Detail & Related papers (2023-08-22T16:05:18Z)
Benchmarking large language models for biomedical natural language processing applications and recommendations [22.668383945059762]
Large Language Models (LLMs) have shown promise in general domains.<n>We compare their zero-shot, few-shot, and fine-tuning performance with traditional fine-tuning of BERT or BART models.<n>We find issues like missing information and hallucinations in LLM outputs.
arXiv Detail & Related papers (2023-05-10T13:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.