Related papers: From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

URL: http://arxiv.org/abs/2507.20122v1
Date: Sun, 27 Jul 2025 04:08:11 GMT
Title: From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
Authors: Khairul Alam, Banani Roy,
Abstract summary: This study investigates whether modern Large Language Models (LLMs) can support the generation of accurate, complete, and usable bioinformatics tasks.<n>We evaluate these models using diverse SNP analysis, RNA-seq, DNA methylation, and data retrieval platforms.<n>The results show that Gemini 2.5 Flash excels in generating Galaxy, while DeepSeek-V3 performs strongly in Nextflow.
Score: 2.2160604288512324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing complexity of bioinformatics data analysis has made Scientific Workflow Systems (SWSs) like Galaxy and Nextflow essential for enabling scalable, reproducible, and automated workflows. However, creating and understanding these workflows remains challenging, particularly for domain experts without programming expertise. This study investigates whether modern Large Language Models (LLMs), GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3, can support the generation of accurate, complete, and usable bioinformatics workflows, and examines which prompting strategies most effectively guide this process. We evaluate these models using diverse tasks such as SNP analysis, RNA-seq, DNA methylation, and data retrieval, spanning both graphical (Galaxy) and script-based (Nextflow) platforms. Expert reviewers assess the generated workflows against community-curated baselines from the Galaxy Training Network and nf-core repositories. The results show that Gemini 2.5 Flash excels in generating Galaxy workflows, while DeepSeek-V3 performs strongly in Nextflow. Prompting strategies significantly impact quality, with role-based and chain-of-thought prompts improving completeness and correctness. While GPT-4o benefits from structured inputs, DeepSeek-V3 offers rich technical detail, albeit with some verbosity. Overall, the findings highlight the potential of LLMs to lower the barrier for workflow development, improve reproducibility, and democratize access to computational tools in bioinformatics, especially when combined with thoughtful prompt engineering.

Related papers

Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain [2.0905671861214894]
SPIRA is a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose respiratory insufficiency via speech analysis.<n>This paper presents an overview of the architecture of the MLES, then compares three versions of its Continuous Training subsystem.<n>The paper shares challenges and lessons learned, offering insights for researchers and practitioners seeking to productionize their pipelines.
arXiv Detail & Related papers (2025-06-07T23:00:13Z)
ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation [71.31634636156384]
We introduce ComfyGPT, the first self-optimizing multi-agent system designed to generate ComfyUI based on task descriptions automatically.<n> ComfyGPT comprises four specialized agents: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent.<n> FlowDataset is a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench is a benchmark for evaluating workflow generation systems.
arXiv Detail & Related papers (2025-03-22T06:48:50Z)
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration. It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories. LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z)
Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z)
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering. Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
Large Language Models to the Rescue: Reducing the Complexity in Scientific Workflow Development Using ChatGPT [11.410608233274942]
Scientific systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets. However, implementing is difficult due to the involvement of many blackbox tools and the deep infrastructure stack necessary for their execution. We investigate the efficiency of Large Language Models, specifically ChatGPT, to support users when dealing with scientific domains.
arXiv Detail & Related papers (2023-11-03T10:28:53Z)
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.