Analyzing GitHub Issues and Pull Requests in nf-core Pipelines: Insights into nf-core Pipeline Repositories
- URL: http://arxiv.org/abs/2601.09612v1
- Date: Wed, 14 Jan 2026 16:34:00 GMT
- Title: Analyzing GitHub Issues and Pull Requests in nf-core Pipelines: Insights into nf-core Pipeline Repositories
- Authors: Khairul Alam, Banani Roy,
- Abstract summary: Nextflow's nf-core community curates standardized, peer-reviewed pipelines that follow strict testing, documentation, and governance guidelines.<n>This paper presents an empirical study of 25,173 issues and pull requests from these pipelines to uncover recurring challenges, management practices, and perceived difficulties.<n>We identify 13 key challenges, including pipeline development and integration, bug fixing, integrating genomic data, managing CI configurations, and handling version updates.
- Score: 4.902956965439
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific Workflow Management Systems (SWfMSs) such as Nextflow have become essential software frameworks for conducting reproducible, scalable, and portable computational analyses in data-intensive fields like genomics, transcriptomics, and proteomics. Building on Nextflow, the nf-core community curates standardized, peer-reviewed pipelines that follow strict testing, documentation, and governance guidelines. Despite its broad adoption, little is known about the challenges users face during the development and maintenance of these pipelines. This paper presents an empirical study of 25,173 issues and pull requests from these pipelines to uncover recurring challenges, management practices, and perceived difficulties. Using BERTopic modeling, we identify 13 key challenges, including pipeline development and integration, bug fixing, integrating genomic data, managing CI configurations, and handling version updates. We then examine issue resolution dynamics, showing that 89.38\% of issues and pull requests are eventually closed, with half resolved within three days. Statistical analysis reveals that the presence of labels (large effect, $δ$ = 0.94) and code snippets (medium effect, $δ$ = 0.50) significantly improve resolution likelihood. Further analysis reveals that tool development and repository maintenance poses the most significant challenges, followed by testing pipelines and CI configurations, and debugging containerized pipelines. Overall, this study provides actionable insights into the collaborative development and maintenance of nf-core pipelines, highlighting opportunities to enhance their usability, sustainability, and reproducibility.
Related papers
- CoDA: Agentic Systems for Collaborative Data Visualization [57.270599188947294]
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations.<n>Existing approaches, including simple single- or multi-agent systems, often oversimplify the task.<n>We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection.
arXiv Detail & Related papers (2025-10-03T17:30:16Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines [18.75611679837171]
We introduce Text-to-Pipeline, a new task that translates NL data preparation instructions into DP pipelines.<n>Parrot is a large-scale benchmark to support systematic evaluation.<n>ParROT is built by mining transformation patterns from production pipelines and instantiating them on 23,009 real-world tables.
arXiv Detail & Related papers (2025-05-21T15:40:53Z) - Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection [83.90563802153707]
PLUSNet is a high-quality Small object detection framework.<n>It comprises three components: the Hierarchical Feature (HFP) framework for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks.
arXiv Detail & Related papers (2025-04-29T10:11:03Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios [31.749442120603774]
Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads.<n>We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues.
arXiv Detail & Related papers (2025-03-16T06:24:51Z) - Data Pipeline Quality: Influencing Factors, Root Causes of Data-related
Issues, and Processing Problem Areas for Developers [4.473327661758546]
This article first introduces a taxonomy of 41 factors that influence the ability of data pipelines to provide quality data.
Data, infrastructure, life cycle management, development & deployment, and processing were found to be the main influencing themes.
arXiv Detail & Related papers (2023-09-13T16:28:10Z) - Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning
Preprocessing Pipelines [77.45213180689952]
Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy.
We introduce a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines.
We obtain an increased throughput of 3x to 13x compared to an untuned system.
arXiv Detail & Related papers (2022-02-17T14:31:58Z) - MLCask: Efficient Management of Component Evolution in Collaborative
Data Analytics Pipelines [29.999324319722508]
We address two main challenges that arise during the deployment of machine learning pipelines, and address them with the design of versioning for an end-to-end analytics system MLCask.
We define and accelerate the metric-driven merge operation by pruning the pipeline search tree using reusable history records and pipeline compatibility information.
The effectiveness of MLCask is evaluated through an extensive study over several real-world deployment cases.
arXiv Detail & Related papers (2020-10-17T13:34:48Z) - Rethinking Learning-based Demosaicing, Denoising, and Super-Resolution
Pipeline [86.01209981642005]
We study the effects of pipelines on the mixture problem of learning-based DN, DM, and SR, in both sequential and joint solutions.
Our suggested pipeline DN$to$SR$to$DM yields consistently better performance than other sequential pipelines.
We propose an end-to-end Trinity Pixel Enhancement NETwork (TENet) that achieves state-of-the-art performance for the mixture problem.
arXiv Detail & Related papers (2019-05-07T13:19:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.