Related papers: An Empirical Investigation on the Challenges in Scientific Workflow Systems Development

An Empirical Investigation on the Challenges in Scientific Workflow Systems Development

URL: http://arxiv.org/abs/2411.10890v1
Date: Sat, 16 Nov 2024 21:14:11 GMT
Title: An Empirical Investigation on the Challenges in Scientific Workflow Systems Development
Authors: Khairul Alam, Chanchal Roy, Banani Roy, Kartik Mittal,
Abstract summary: This study examines interactions between developers and researchers on Stack Overflow (SO) and GitHub. By analyzing issues, we identified 13 topics (e.g., Errors and Bug Fixing, Documentation, Dependencies) and discovered that data structures and operations is the most difficult. We also found common topics between SO and GitHub, such as data structures and operations, task management, and workflow scheduling.
Score: 2.704899832646869
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scientific Workflow Systems (SWSs) are advanced software frameworks that drive modern research by orchestrating complex computational tasks and managing extensive data pipelines. These systems offer a range of essential features, including modularity, abstraction, interoperability, workflow composition tools, resource management, error handling, and comprehensive documentation. Utilizing these frameworks accelerates the development of scientific computing, resulting in more efficient and reproducible research outcomes. However, developing a user-friendly, efficient, and adaptable SWS poses several challenges. This study explores these challenges through an in-depth analysis of interactions on Stack Overflow (SO) and GitHub, key platforms where developers and researchers discuss and resolve issues. In particular, we leverage topic modeling (BERTopic) to understand the topics SWSs developers discuss on these platforms. We identified 10 topics developers discuss on SO (e.g., Workflow Creation and Scheduling, Data Structures and Operations, Workflow Execution) and found that workflow execution is the most challenging. By analyzing GitHub issues, we identified 13 topics (e.g., Errors and Bug Fixing, Documentation, Dependencies) and discovered that data structures and operations is the most difficult. We also found common topics between SO and GitHub, such as data structures and operations, task management, and workflow scheduling. Additionally, we categorized each topic by type (How, Why, What, and Others). We observed that the How type consistently dominates across all topics, indicating a need for procedural guidance among developers. The dominance of the How type is also evident in domains like Chatbots and Mobile development. Our study will guide future research in proposing tools and techniques to help the community overcome the challenges developers face when developing SWSs.

Related papers

Issue Tracking Ecosystems: Context and Best Practices [0.0]
GitHub and Jira are popular tools that support Software Engineering (SE) organisations through the management of issues''<n>An Issue Tracking Ecosystem (ITE) is the aggregate of the central ITS and the related SE artefacts, stakeholders, and processes.<n>I undertake the challenge of understanding ITEs at a broader level, addressing these questions regarding complexity and diversity.
arXiv Detail & Related papers (2025-07-09T09:57:13Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts [2.704899832646869]
Large Language Models (LLMs) have gained widespread popularity due to their exceptional capabilities across various domains. This study investigates developers' challenges by analyzing community interactions on Stack Overflow and OpenAI Developer Forum.
arXiv Detail & Related papers (2024-11-16T19:38:27Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z)
A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond [84.95530356322621]
This survey presents a systematic review of the advancements in code intelligence. It covers over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence.
arXiv Detail & Related papers (2024-03-21T08:54:56Z)
Mining Issue Trackers: Concepts and Techniques [6.99674326582747]
Internal and external stakeholders report, manage, and discuss "issues" An issue tracker is a software tool used by organisations to interact with users and manage various aspects of the software development lifecycle. This chapter discusses four major use cases for algorithmically analysing issue data.
arXiv Detail & Related papers (2024-03-08T23:02:41Z)
An Empirical Study of Challenges in Machine Learning Asset Management [15.07444988262748]
Despite existing research, a significant knowledge gap remains in operational challenges like model versioning, data traceability, and collaboration. Our study aims to address this gap by analyzing 15,065 posts from developer forums and platforms. We uncover 133 topics related to asset management challenges, grouped into 16 macro-topics, with software dependency, model deployment, and model training being the most discussed.
arXiv Detail & Related papers (2024-02-25T05:05:52Z)
On the Interaction between Software Engineers and Data Scientists when building Machine Learning-Enabled Systems [1.2184324428571227]
Machine Learning (ML) components have been increasingly integrated into the core systems of organizations. One of the key challenges is the effective interaction between actors with different backgrounds who need to work closely together. This paper presents an exploratory case study to understand the current interaction and collaboration dynamics between these roles in ML projects.
arXiv Detail & Related papers (2024-02-08T00:27:56Z)
Revolutionizing API Documentation through Summarization [0.0]
API documentation can be lengthy and challenging to navigate, prompting developers to seek unofficial sources such as Stack Overflow. We employ BERTopic and extractive summarization to automatically generate concise and informative API summaries. These summaries encompass key insights like general usage, common developer issues, and potential solutions, sourced from the wealth of knowledge on Stack Overflow.
arXiv Detail & Related papers (2024-01-21T01:18:08Z)
Reusability Challenges of Scientific Workflows: A Case Study for Galaxy [56.78572674167333]
This study examined the reusability of existing and exposed several challenges. The challenges preventing reusability include tool upgrading, tool support, design flaws, incomplete, failure to load a workflow, etc.
arXiv Detail & Related papers (2023-09-13T20:17:43Z)
The GitHub Development Workflow Automation Ecosystems [47.818229204130596]
Large-scale software development has become a highly collaborative endeavour. This chapter explores the ecosystems of development bots and GitHub Actions. It provides an extensive survey of the state-of-the-art in this domain.
arXiv Detail & Related papers (2023-05-08T15:24:23Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.