Ecosystem-wide influences on pull request decisions: insights from NPM
- URL: http://arxiv.org/abs/2410.14695v2
- Date: Mon, 10 Mar 2025 07:29:00 GMT
- Title: Ecosystem-wide influences on pull request decisions: insights from NPM
- Authors: Willem Meijer, Mirela Riveni, Ayushi Rastogi,
- Abstract summary: We collect a dataset of approximately 1.8 million pull requests and 2.1 million issues from 20,052 GitHub projects within the NPM ecosystem.<n>We find that developers with ecosystem experience make different contributions than users without.<n>We find that combining ecosystem-wide factors with features studied in previous work to predict the outcome of pull requests reached an overall F1 score of 0.92.
- Score: 1.7205106391379021
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The pull-based development model facilitates global collaboration within open-source software projects. However, whereas it is increasingly common for software to depend on other projects in their ecosystem, most research on the pull request decision-making process explored factors within projects, not the broader software ecosystem they comprise. We uncover ecosystem-wide factors that influence pull request acceptance decisions. We collected a dataset of approximately 1.8 million pull requests and 2.1 million issues from 20,052 GitHub projects within the NPM ecosystem. Of these, 98% depend on another project in the dataset, enabling studying collaboration across dependent projects. We employed social network analysis to create a collaboration network in the ecosystem, and mixed effects logistic regression and random forest techniques to measure the impact and predictive strength of the tested features. We find that gaining experience within the software ecosystem through active participation in issue-tracking systems, submitting pull requests, and collaborating with pull request integrators and experienced developers benefits all open-source contributors, especially project newcomers. These results are complemented with an exploratory qualitative analysis of 538 pull requests. We find that developers with ecosystem experience make different contributions than users without. Zooming in on a subset of 111 pull requests with clear ecosystem involvement, we find 3 overarching and 10 specific reasons why developers involve ecosystem projects in their pull requests. The results show that combining ecosystem-wide factors with features studied in previous work to predict the outcome of pull requests reached an overall F1 score of 0.92. However, the outcomes of pull requests submitted by newcomers are harder to predict.
Related papers
- Why Authors and Maintainers Link (or Don't Link) Their PyPI Libraries to Code Repositories and Donation Platforms [83.16077040470975]
Metadata of libraries on the Python Package Index (PyPI) plays a critical role in supporting the transparency, trust, and sustainability of open-source libraries.<n>This paper presents a large-scale empirical study combining two targeted surveys sent to 50,000 PyPI authors and maintainers.<n>We analyze more than 1,400 responses using large language model (LLM)-based topic modeling to uncover key motivations and barriers related to linking repositories and donation platforms.
arXiv Detail & Related papers (2026-01-21T16:13:57Z) - OpenOneRec Technical Report [99.17075873619352]
OneRec series has successfully unified the fragmented recommendation pipeline into an end-to-end generative framework.<n>OneRec Foundation (1.7B and 8B), a family of models establishing new state-of-the-art (SOTA) results across all tasks in RecIF-Bench.<n>When transferred to the Amazon benchmark, our models surpass the strongest baselines with an average 26.8% improvement in Recall@10 across 10 diverse datasets.
arXiv Detail & Related papers (2025-12-31T10:15:53Z) - Who Do You Think You Are? Creating RSE Personas from GitHub Interactions [0.0]
We describe an approach combining software repository mining and data-driven personas applied to research software (RS) development.<n>This allows individuals and RS project teams to understand their contributions, impact and repository dynamics.<n>We demonstrate how the RSE personas method successfully characterises a sample of 115,174 repository contributors across 1,284 RS repositories on GitHub.
arXiv Detail & Related papers (2025-10-06T21:35:05Z) - Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z) - Open Source Software Lifecycle Classification: Developing Wrangling Techniques for Complex Sociotechnical Systems [0.0]
This paper reviews previous attempts to classify open source software and other organizational ecosystems.
It examines the divergent and sometimes conflicting purposes that may exist for classifying open source projects and how these competing interests impede our progress in developing a comprehensive understanding of how open source software projects and companies operate.
arXiv Detail & Related papers (2025-04-23T12:37:53Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [83.65386456026441]
Data-Juicer 2.0 is a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, synthesis, annotation, and foundation model post-training.<n>The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Empirical Analysis of Pull Requests for Google Summer of Code [0.0]
The Google Summer of Code (GSoC) is a global initiative that matches students or new contributors with experienced mentors to work on open-source projects.
This study presents an empirical analysis of pull requests created by interns during the GSoC program.
arXiv Detail & Related papers (2024-12-17T17:42:43Z) - Lingma SWE-GPT: An Open Development-Process-Centric Language Model for Automated Software Improvement [62.94719119451089]
Lingma SWE-GPT series learns from and simulating real-world code submission activities.
Lingma SWE-GPT 72B resolves 30.20% of GitHub issues, marking a significant improvement in automatic issue resolution.
arXiv Detail & Related papers (2024-11-01T14:27:16Z) - Characterising Open Source Co-opetition in Company-hosted Open Source Software Projects: The Cases of PyTorch, TensorFlow, and Transformers [5.2337753974570616]
Companies, including market rivals, have long collaborated on the development of open source software (OSS)
"Open source co-opetition" results in a tangle of co-operation and competition known as "open source co-opetition"
arXiv Detail & Related papers (2024-10-23T19:35:41Z) - CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing [70.25689961697523]
We propose a generalizable algorithm that enhances sequential reasoning by cross-task experience sharing and selection.
Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross-task experiences.
arXiv Detail & Related papers (2024-10-22T03:59:53Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Multi-domain Knowledge Graph Collaborative Pre-training and Prompt Tuning for Diverse Downstream Tasks [48.102084345907095]
Knowledge graph pre-training (KGP) aims to pre-train neural networks on large-scale Knowledge graphs (KGs)
MuDoK is a plug-and-play prompt learning approach that can be adapted to different downstream task backbones.
Our framework brings significant performance gains, along with its generality, efficiency, and transferability.
arXiv Detail & Related papers (2024-05-21T08:22:14Z) - Promises and Perils of Mining Software Package Ecosystem Data [10.787686237395816]
Third-party packages have led to the emergence of large software package ecosystems with a maze of inter-dependencies.
Understanding the infrastructure and dynamics of package ecosystems has given rise to approaches for better code reuse, automated updates, and the avoidance of vulnerabilities.
In this chapter, we review promises and perils of mining the rich data related to software package ecosystems available to software engineering researchers.
arXiv Detail & Related papers (2023-05-29T03:09:48Z) - The GitHub Development Workflow Automation Ecosystems [47.818229204130596]
Large-scale software development has become a highly collaborative endeavour.
This chapter explores the ecosystems of development bots and GitHub Actions.
It provides an extensive survey of the state-of-the-art in this domain.
arXiv Detail & Related papers (2023-05-08T15:24:23Z) - Studying the Characteristics of AIOps Projects on GitHub [14.58848716249407]
We conduct an in-depth analysis of open-source AIOps projects to understand the characteristics of AIOps in practice.
We identify a set of AIOps projects from GitHub and analyze their repository metrics.
Finally, we assess the quality of these projects using different quality metrics, such as the number of bugs.
arXiv Detail & Related papers (2022-12-26T18:24:45Z) - Outsourcing Training without Uploading Data via Efficient Collaborative
Open-Source Sampling [49.87637449243698]
Traditional outsourcing requires uploading device data to the cloud server.
We propose to leverage widely available open-source data, which is a massive dataset collected from public and heterogeneous sources.
We develop a novel strategy called Efficient Collaborative Open-source Sampling (ECOS) to construct a proximal proxy dataset from open-source data for cloud training.
arXiv Detail & Related papers (2022-10-23T00:12:18Z) - Code Recommendation for Open Source Software Developers [32.181023933552694]
CODER is a novel graph-based code recommendation framework for open source software developers.
Our framework achieves superior performance under various experimental settings, including intra-project, cross-project, and cold-start recommendation.
arXiv Detail & Related papers (2022-10-15T16:40:36Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z) - Knowledge Graph Question Answering Leaderboard: A Community Resource to
Prevent a Replication Crisis [61.740077541531726]
We provide a new central and open leaderboard for any KGQA benchmark dataset as a focal point for the community.
Our analysis highlights existing problems during the evaluation of KGQA systems.
arXiv Detail & Related papers (2022-01-20T13:46:01Z) - Estimating Fund-Raising Performance for Start-up Projects from a Market
Graph Perspective [58.353799280109904]
We propose a Graph-based Market Environment (GME) model for predicting the fund-raising performance of the unpublished project by exploiting the market environment.
Specifically, we propose a Graph-based Market Environment (GME) model for predicting the fund-raising performance of the unpublished project by exploiting the market environment.
arXiv Detail & Related papers (2021-05-27T02:39:30Z) - Enabling collaborative data science development with the Ballet
framework [9.424574945499844]
We present a novel conceptual framework and ML programming model to address challenges to scaling data science collaborations.
We instantiate these ideas in Ballet, a lightweight software framework for collaborative open-source data science.
arXiv Detail & Related papers (2020-12-14T18:51:23Z) - Representation of Developer Expertise in Open Source Software [12.583969739954526]
We use the World of Code infrastructure to extract the complete set of APIs in the files changed by open source developers.
We then employ Doc2Vec embeddings for vector representations of APIs, developers, and projects.
We evaluate if these embeddings reflect the postulated topology of the Skill Space.
arXiv Detail & Related papers (2020-05-20T16:36:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.