ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks
- URL: http://arxiv.org/abs/2410.04601v1
- Date: Sun, 6 Oct 2024 19:28:55 GMT
- Title: ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks
- Authors: Seungjun Yi, Jaeyoung Lim, Juyong Yoon,
- Abstract summary: Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT)
We propose a flexible, automatic framework to evaluate LLM's capability on SPFT: ProtocoLLM.
We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini.
- Score: 0.5266869303483376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated generation of scientific protocols executable by robots can significantly accelerate scientific research processes. Large Language Models (LLMs) excel at Scientific Protocol Formulation Tasks (SPFT), but the evaluation of their capabilities rely on human evaluation. Here, we propose a flexible, automatic framework to evaluate LLM's capability on SPFT: ProtocoLLM. This framework prompts the target model and GPT-4 to extract pseudocode from biology protocols using only predefined lab actions and evaluates the output of target model using LLAM-EVAL, the pseudocode generated by GPT-4 serving as a baseline and Llama-3 acting as the evaluator. Our adaptable prompt-based evaluation method, LLAM-EVAL, offers significant flexibility in terms of evaluation model, material, criteria, and is free of cost. We evaluate GPT variations, Llama, Mixtral, Gemma, Cohere, and Gemini. Overall, we find that GPT and Cohere is a powerful scientific protocol formulators. We also introduce BIOPROT 2.0, a dataset with biology protocols and corresponding pseudocodes, which can aid LLMs in formulation and evaluation of SPFT. Our work is extensible to assess LLMs on SPFT across various domains and other fields that require protocol generation for specific goals.
Related papers
- BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models [1.1906923449182683]
We present BMFM-RNA, an open-source, modular software package that unifies diverse TFM pretraining and fine-tuning objectives.<n>We introduce a novel training objective, whole cell expression decoder (WCED), which captures global expression patterns using an autoencoder-like CLS bottleneck representation.<n>We show that WCED-based models achieve performance that matches or exceeds state-of-the-art approaches like scGPT.
arXiv Detail & Related papers (2025-06-17T15:40:08Z) - BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning [31.739027752007928]
We present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning.<n>Built upon 27K original protocols, it yields nearly 556K high-quality structured instances.
arXiv Detail & Related papers (2025-05-11T09:42:24Z) - LLM Agent Swarm for Hypothesis-Driven Drug Discovery [2.7036595757881323]
PharmaSwarm is a unified multi-agent framework that orchestrates specialized "agents" to propose, validate, and refine hypotheses for novel drug targets and lead compounds.
By acting as an AI copilot, PharmaSwarm can accelerate translational research and deliver high-confidence hypotheses more efficiently than traditional pipelines.
arXiv Detail & Related papers (2025-04-24T22:27:50Z) - DataSciBench: An LLM Agent Benchmark for Data Science [33.3811507234528]
DataSciBench is a benchmark for evaluating Large Language Model (LLM) capabilities in data science.
We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics.
We propose an innovative Task - Function - Code framework to assess each code execution outcome.
arXiv Detail & Related papers (2025-02-19T17:31:51Z) - ACE-$M^3$: Automatic Capability Evaluator for Multimodal Medical Models [34.81544597731073]
We introduce ACE-$M3$, an open-sourced textbfAutomatic textbfCapability textbfEvaluator for textbfMultimodal textbfMedical textbfModels.
It first utilizes a branch-merge architecture to provide both detailed analysis and a concise final score based on standard medical evaluation criteria.
arXiv Detail & Related papers (2024-12-16T05:15:43Z) - Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark [62.58869921806019]
We propose a task decomposition evaluation framework based on GPT-4o to automatically construct a new training dataset.
We design innovative training strategies to effectively distill GPT-4o's evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6.
Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline.
arXiv Detail & Related papers (2024-11-23T08:06:06Z) - ReIFE: Re-evaluating Instruction-Following Evaluation [105.75525154888655]
We present a thorough meta-evaluation of instruction following, including 25 base LLMs and 15 proposed evaluation protocols.
Our evaluation allows us to identify the best-performing base LLMs and evaluation protocols with a high degree of robustness.
arXiv Detail & Related papers (2024-10-09T17:14:50Z) - AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation.
We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z) - The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation [4.524402497958597]
This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-agents.
We evaluate the effectiveness and potential of these in automating and enhancing the dataset generation process.
arXiv Detail & Related papers (2024-08-16T12:01:55Z) - Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm [15.627870862369784]
Large language models (LLMs) are gaining increasing interests to improve clinical efficiency for medical diagnosis.
We propose an automatic evaluation paradigm tailored to assess the LLMs' capabilities in delivering clinical services.
arXiv Detail & Related papers (2024-03-25T06:17:54Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology [41.952424120054914]
Large Language Models (LLMs) have impressive capabilities on a wide range of tasks.
We present an automatic evaluation framework for the task of planning experimental protocols.
We evaluate GPT-3 and GPT-4 on this task and explore their robustness.
arXiv Detail & Related papers (2023-10-16T17:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.