Related papers: BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology

BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology

URL: http://arxiv.org/abs/2310.10632v1
Date: Mon, 16 Oct 2023 17:54:20 GMT
Title: BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology
Authors: Odhran O'Donoghue, Aleksandar Shtedritski, John Ginger, Ralph Abboud, Ali Essa Ghareeb, Justin Booth, Samuel G Rodriques
Abstract summary: Large Language Models (LLMs) have impressive capabilities on a wide range of tasks. We present an automatic evaluation framework for the task of planning experimental protocols. We evaluate GPT-3 and GPT-4 on this task and explore their robustness.
Score: 41.952424120054914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to automatically generate accurate protocols for scientific experiments would represent a major step towards the automation of science. Large Language Models (LLMs) have impressive capabilities on a wide range of tasks, such as question answering and the generation of coherent text and code. However, LLMs can struggle with multi-step problems and long-term planning, which are crucial for designing scientific experiments. Moreover, evaluation of the accuracy of scientific protocols is challenging, because experiments can be described correctly in many different ways, require expert knowledge to evaluate, and cannot usually be executed automatically. Here we present an automatic evaluation framework for the task of planning experimental protocols, and we introduce BioProt: a dataset of biology protocols with corresponding pseudocode representations. To measure performance on generating scientific protocols, we use an LLM to convert a natural language protocol into pseudocode, and then evaluate an LLM's ability to reconstruct the pseudocode from a high-level description and a list of admissible pseudocode functions. We evaluate GPT-3 and GPT-4 on this task and explore their robustness. We externally validate the utility of pseudocode representations of text by generating accurate novel protocols using retrieved pseudocode, and we run a generated protocol successfully in our biological laboratory. Our framework is extensible to the evaluation and improvement of language model planning abilities in other areas of science or other areas that lack automatic evaluation.

Related papers

ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning [31.739027752007928]
We present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning.<n>Built upon 27K original protocols, it yields nearly 556K high-quality structured instances.
arXiv Detail & Related papers (2025-05-11T09:42:24Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards [49.7719149179179]
This paper investigates the feasibility of using PPO for reinforcement learning (RL) from explicitly programmed reward signals. We focus on tasks expressed through formal languages, such as programming, where explicit reward functions can be programmed to automatically assess quality of generated outputs. Our results show that pure RL-based training for the two formal language tasks is challenging, with success being limited even for the simple arithmetic task.
arXiv Detail & Related papers (2024-10-22T15:59:58Z)
ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks [0.5266869303483376]
Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT) We propose a flexible, automatic framework to evaluate LLM's capability on SPFT: ProtocoLLM. We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini.
arXiv Detail & Related papers (2024-10-06T19:28:55Z)
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [54.51932175059004]
We introduce a scalable method for generating synthetic instructions to enhance the code generation capability of Large Language Models. The proposed algorithm, Genetic-Instruct, mimics evolutionary processes, utilizing self-instruction to create numerous synthetic samples from a limited number of seeds.
arXiv Detail & Related papers (2024-07-29T20:42:59Z)
LAB-Bench: Measuring Capabilities of Language Models for Biology Research [1.6312096924271486]
We introduce the Language Agent Biology Benchmark (LAB-Bench) It is a dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities. We measure performance of several frontier language models against our benchmark and report results compared to human expert biology researchers.
arXiv Detail & Related papers (2024-07-14T23:52:25Z)
Boolean matrix logic programming for active learning of gene functions in genome-scale metabolic network models [4.762323642506732]
We seek to apply logic-based machine learning techniques to facilitate cellular engineering and drive biological discovery. We introduce a new system, $BMLP_active$, which efficiently explores the genomic hypothesis space by guiding informative experimentation. $BMLP_active$ can successfully learn the interaction between a gene pair with fewer training examples than random experimentation.
arXiv Detail & Related papers (2024-05-10T09:51:06Z)
CRISPR-GPT: An LLM Agent for Automated Design of Gene-Editing Experiments [51.41735920759667]
Large Language Models (LLMs) have shown promise in various tasks, but they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments.
arXiv Detail & Related papers (2024-04-27T22:59:17Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs [7.746160514029531]
We demonstrate experimental results with LLMs that address robotics task planning problems. Our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning. Our approach is evaluated on a multi-modal prompt simulation benchmark.
arXiv Detail & Related papers (2024-03-20T17:58:12Z)
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis [55.30328162764292]
Chemist-X is a comprehensive AI agent that automates the reaction condition optimization (RCO) task in chemical synthesis. The agent uses retrieval-augmented generation (RAG) technology and AI-controlled wet-lab experiment executions. Results of our automatic wet-lab experiments, achieved by fully LLM-supervised end-to-end operation with no human in the lope, prove Chemist-X's ability in self-driving laboratories.
arXiv Detail & Related papers (2023-11-16T01:21:33Z)
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z)
Automatic coding of students' writing via Contrastive Representation Learning in the Wasserstein space [6.884245063902909]
This work is a step towards building a statistical machine learning (ML) method for supporting qualitative analyses of students' writing. We show that the ML algorithm approached the inter-rater reliability of human analysis.
arXiv Detail & Related papers (2020-11-26T16:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.