From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
- URL: http://arxiv.org/abs/2507.20122v2
- Date: Thu, 14 Aug 2025 22:22:19 GMT
- Title: From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics
- Authors: Khairul Alam, Banani Roy,
- Abstract summary: This study explores whether state-of-the-art Large Language Models can assist in generating accurate, complete, and usable bioinformatics.<n>The generated are evaluated against community-curated baselines from the Galaxy Training Network and nf-core.<n>Results show that Gemini 2.5 Flash produced the most accurate and user-friendly in Galaxy, while DeepSeek-V3 excelled in Nextflow pipeline generation.
- Score: 2.2160604288512324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific Workflow Systems such as Galaxy and Nextflow are essential for scalable, reproducible, and automated bioinformatics analyses. However, developing and understanding scientific workflows remains challenging for many domain scientists due to the complexity of tool/module selection, infrastructure requirements, and limited programming expertise. This study explores whether state-of-the-art Large Language Models such as GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3 can assist in generating accurate, complete, and usable bioinformatics workflows. We evaluate a set of representative workflows covering tasks such as RNA-seq, SNP analysis, and DNA methylation across both Galaxy (graphical) and Nextflow (script-based) platforms. To simulate realistic usage, we adopt a tiered prompting strategy: each workflow is first generated using an instruction-only prompt; if the output is incomplete or incorrect, we escalate to a role-based prompt, and finally to chain-of-thought prompting if needed. The generated workflows are evaluated against community-curated baselines from the Galaxy Training Network and nf-core, using criteria including correctness, completeness, tool appropriateness, and executability. Results show that LLMs exhibit strong potential in workflow development. Gemini 2.5 Flash produced the most accurate and user-friendly workflows in Galaxy, while DeepSeek-V3 excelled in Nextflow pipeline generation. GPT-4o performed nicely with structured prompts. Prompting strategy significantly influenced output quality, with role-based and chain-of-thought prompts enhancing correctness and completeness. Overall, LLMs can reduce the cognitive and technical barriers to workflow development, making SWSs more accessible to novice and expert users. This work highlights the practical utility of LLMs and provides actionable insights for integrating them into real-world bioinformatics workflow design.
Related papers
- TraceLLM: Leveraging Large Language Models with Prompt Engineering for Enhanced Requirements Traceability [4.517933493143603]
This paper introduces TraceLLM, a framework for enhancing requirements traceability through prompt engineering and demonstration selection.<n>We assess prompt generalization and robustness using eight state-of-the-art LLMs on four benchmark datasets.
arXiv Detail & Related papers (2026-02-01T14:29:13Z) - Large Language Model Agent for User-friendly Chemical Process Simulations [0.0]
A large language model (LLM) agent is integrated with AVEVA Process Model Protocol (MCP), allowing natural language simulations.<n>Two case studies assess the framework across different task complexities and interaction modes.<n>The framework benefits both educational purposes, by translating technical concepts and demonstrating, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting.<n>While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework suggests LLM-based agents can become valuable collaborators.
arXiv Detail & Related papers (2026-01-15T12:18:45Z) - Towards LLM-Powered Task-Aware Retrieval of Scientific Workflows for Galaxy [5.3326639738035055]
We propose a task-aware, two-stage retrieval framework that integrates dense vector search with large language model (LLM)-based reranking.<n>Our system first retrieves candidate using state-of-the-art embedding models and then reranks them using instruction-tuned generative LLMs.<n>We conduct a comprehensive comparison of lexical, dense, and reranking models using standard IR metrics, presenting the first systematic evaluation of retrieval performance in the Galaxy ecosystem.
arXiv Detail & Related papers (2025-11-03T17:12:03Z) - Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain [2.0905671861214894]
SPIRA is a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose respiratory insufficiency via speech analysis.<n>This paper presents an overview of the architecture of the MLES, then compares three versions of its Continuous Training subsystem.<n>The paper shares challenges and lessons learned, offering insights for researchers and practitioners seeking to productionize their pipelines.
arXiv Detail & Related papers (2025-06-07T23:00:13Z) - ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation [71.31634636156384]
We introduce ComfyGPT, the first self-optimizing multi-agent system designed to generate ComfyUI based on task descriptions automatically.<n> ComfyGPT comprises four specialized agents: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent.<n> FlowDataset is a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench is a benchmark for evaluating workflow generation systems.
arXiv Detail & Related papers (2025-03-22T06:48:50Z) - WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.46456444315693]
We presentLLM, a data-centric framework to enhance the capability of large language models in workflow orchestration.
It first constructs a large-scale fine-tuningBench with 106,763 samples, covering 1,503 APIs from 83 applications across 28 categories.
LlamaLlama demonstrates a strong capacity to orchestrate complex APIs, while also achieving notable generalization performance.
arXiv Detail & Related papers (2024-11-08T09:58:02Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Large Language Models to the Rescue: Reducing the Complexity in
Scientific Workflow Development Using ChatGPT [11.410608233274942]
Scientific systems are increasingly popular for expressing and executing complex data analysis pipelines over large datasets.
However, implementing is difficult due to the involvement of many blackbox tools and the deep infrastructure stack necessary for their execution.
We investigate the efficiency of Large Language Models, specifically ChatGPT, to support users when dealing with scientific domains.
arXiv Detail & Related papers (2023-11-03T10:28:53Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.