Can Large Language Models Replace Data Scientists in Biomedical Research?
- URL: http://arxiv.org/abs/2410.21591v2
- Date: Tue, 08 Apr 2025 21:48:54 GMT
- Title: Can Large Language Models Replace Data Scientists in Biomedical Research?
- Authors: Zifeng Wang, Benjamin Danek, Ziwei Yang, Zheng Chen, Jimeng Sun,
- Abstract summary: Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests.<n>This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data.<n>We developed a platform that integrates LLMs into the data science workflow for medical professionals.
- Score: 28.211990967264818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data science plays a critical role in biomedical research, but it requires professionals with expertise in coding and medical data analysis. Large language models (LLMs) have shown great potential in supporting medical tasks and performing well in general coding tests. However, existing evaluations fail to assess their capability in biomedical data science, particularly in handling diverse data types such as genomics and clinical datasets. To address this gap, we developed a benchmark of data science coding tasks derived from the analyses of 39 published studies. This benchmark comprises 293 coding tasks (128 in Python and 165 in R) performed on real-world TCGA-type genomics and clinical data. Our findings reveal that the vanilla prompting of LLMs yields suboptimal performances due to drawbacks in following input instructions, understanding target data, and adhering to standard analysis practices. Next, we benchmarked six cutting-edge LLMs and advanced adaptation methods, finding two methods to be particularly effective: chain-of-thought prompting, which provides a step-by-step plan for data analysis, which led to a 21% code accuracy improvement (56.6% versus 35.3%); and self-reflection, enabling LLMs to refine the buggy code iteratively, yielding an 11% code accuracy improvement (45.5% versus 34.3%). Building on these insights, we developed a platform that integrates LLMs into the data science workflow for medical professionals. In a user study with five medical professionals, we found that while LLMs cannot fully automate programming tasks, they significantly streamline the programming process. We found that 80% of their submitted code solutions were incorporated from LLM-generated code, with up to 96% reuse in some cases. Our analysis highlights the potential of LLMs to enhance data science efficiency in biomedical research when integrated into expert workflows.
Related papers
- BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology [0.8061245870721293]
Large Language Models (LLMs) and LLM-based agents show great promise in accelerating scientific research.
We present the Bioinformatics Benchmark (BixBench), a dataset comprising over 50 real-world scenarios of practical biological data analysis.
We evaluate the performance of two frontier LLMs using a custom agent framework we open source.
arXiv Detail & Related papers (2025-02-28T18:47:57Z) - DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data [3.469567586411153]
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data.
This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks.
arXiv Detail & Related papers (2024-08-25T13:36:22Z) - When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine Learning Applications? [8.89829757177796]
We examine the effectiveness of vector representations from last hidden states of Large Language Models for medical diagnostics and prognostics.
We focus on instruction-tuned LLMs in a zero-shot setting to represent abnormal physiological data and evaluate their utilities as feature extractors.
Although findings suggest the raw data features still prevails in medical ML tasks, zero-shot LLM embeddings demonstrate competitive results.
arXiv Detail & Related papers (2024-08-15T03:56:40Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases.
The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting.
We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
arXiv Detail & Related papers (2024-01-25T19:57:00Z) - LLMs for Science: Usage for Code Generation and Data Analysis [0.07499722271664144]
Large language models (LLMs) have been touted to enable increased productivity in many areas of today's work life.
It is still unclear how the potential of LLMs will materialise in research practice.
arXiv Detail & Related papers (2023-11-28T12:29:33Z) - Large Language Models Illuminate a Progressive Pathway to Artificial
Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning.
This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z) - Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data
Generation with Large Language Models [48.07083163501746]
Clinical natural language processing requires methods that can address domain-specific challenges.
We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process.
Our empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks.
arXiv Detail & Related papers (2023-11-01T04:37:28Z) - Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - PyTrial: Machine Learning Software and Benchmark for Clinical Trial
Applications [49.69824178329405]
PyTrial provides benchmarks and open-source implementations of a series of machine learning algorithms for clinical trial design and operations.
We thoroughly investigate 34 ML algorithms for clinical trials across 6 different tasks, including patient outcome prediction, trial site selection, trial outcome prediction, patient-trial matching, trial similarity search, and synthetic data generation.
PyTrial defines each task through a simple four-step process: data loading, model specification, model training, and model evaluation, all achievable with just a few lines of code.
arXiv Detail & Related papers (2023-06-06T21:19:03Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.