LAB-Bench: Measuring Capabilities of Language Models for Biology Research
- URL: http://arxiv.org/abs/2407.10362v3
- Date: Wed, 17 Jul 2024 17:28:36 GMT
- Title: LAB-Bench: Measuring Capabilities of Language Models for Biology Research
- Authors: Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, Samuel G. Rodriques,
- Abstract summary: We introduce the Language Agent Biology Benchmark (LAB-Bench)
It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities.
We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
- Score: 1.6312096924271486
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench
Related papers
- IdeaBench: Benchmarking Large Language Models for Research Idea Generation [19.66218274796796]
Large Language Models (LLMs) have transformed how people interact with artificial intelligence (AI) systems.
We propose IdeaBench, a benchmark system that includes a comprehensive dataset and an evaluation framework.
Our dataset comprises titles and abstracts from a diverse range of influential papers, along with their referenced works.
Our evaluation framework is a two-stage process: first, using GPT-4o to rank ideas based on user-specified quality indicators such as novelty and feasibility, enabling scalable personalization.
arXiv Detail & Related papers (2024-10-31T17:04:59Z) - CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark [11.794931453828974]
CORE-Bench is a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine)
We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way.
The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.
arXiv Detail & Related papers (2024-09-17T17:13:19Z) - DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - Diagnostic Reasoning in Natural Language: Computational Model and Application [68.47402386668846]
We investigate diagnostic abductive reasoning (DAR) in the context of language-grounded tasks (NL-DAR)
We propose a novel modeling framework for NL-DAR based on Pearl's structural causal models.
We use the resulting dataset to investigate the human decision-making process in NL-DAR.
arXiv Detail & Related papers (2024-09-09T06:55:37Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - BioKGBench: A Knowledge Graph Checking Benchmark of AI Agent for Biomedical Science [43.624608816218505]
BioKGBench is an evaluation benchmark for AI-driven biomedical agents.
We first disentangle "Understanding Literature" into two atomic abilities.
We then formulate a novel agent task, dubbed KGCheck, using KGQA and domain-based Retrieval-Augmented Generation.
We collect over two thousand data for two atomic tasks and 225 high-quality annotated data for the agent task.
arXiv Detail & Related papers (2024-06-29T15:23:28Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Bio-SIEVE: Exploring Instruction Tuning Large Language Models for
Systematic Review Automation [6.452837513222072]
Large Language Models (LLMs) can support and be trained to perform literature screening for medical systematic reviews.
Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches.
We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process.
arXiv Detail & Related papers (2023-08-12T16:56:55Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Automatic coding of students' writing via Contrastive Representation
Learning in the Wasserstein space [6.884245063902909]
This work is a step towards building a statistical machine learning (ML) method for supporting qualitative analyses of students' writing.
We show that the ML algorithm approached the inter-rater reliability of human analysis.
arXiv Detail & Related papers (2020-11-26T16:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.