Related papers: Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

URL: http://arxiv.org/abs/2201.04588v1
Date: Wed, 12 Jan 2022 17:25:30 GMT
Title: Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set
Authors: Christoph Gote, Pavlin Mavrodiev, Frank Schweitzer, Ingo Scholtes
Abstract summary: We study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data. We provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity.
Score: 1.1470070927586014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project's size and structure influence team productivity, a question famously considered in Brooks' law. Recent studies using massive repository data suggest that developers in larger teams tend to be less productive than smaller teams. Despite using similar methods and data, other studies argue for a positive linear or even super-linear relationship between team size and productivity, thus contesting the view of software economics that software projects are diseconomies of scale. In our work, we study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data. We further provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity. Our work contributes to the ongoing discussion on the choice of productivity metrics in the operationalisation of hypotheses about determinants of successful software projects. It further highlights general pitfalls in big data analysis and shows that the use of bigger data sets does not automatically lead to more reliable insights.

Related papers

Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
Does the Tool Matter? Exploring Some Causes of Threats to Validity in Mining Software Repositories [9.539825294372786]
We use two tools to extract and analyse ten large software projects. Despite similar trends, even simple metrics such as the numbers of commits and developers may differ by up to 500%. We find that such substantial differences are often caused by minor technical details.
arXiv Detail & Related papers (2025-01-25T07:42:56Z)
The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot [4.8256226973915455]
Using GitHub's proprietary Copilot usage data, we find that Copilot use increases project-level code contributions by 5.9%.<n>This gain is driven by a 2.1% increase in individual code contributions and a 3.4% rise in developer coding participation.<n>While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts.
arXiv Detail & Related papers (2024-10-02T23:26:10Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency. We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z)
DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
Guiding Effort Allocation in Open-Source Software Projects Using Bus Factor Analysis [1.0878040851638]
Bus Factor (BF) of a project defined as 'the number of key developers who would need to be incapacitated to make a project unable to proceed' We propose using other metrics like lines of code changes (LOCC) and cosine difference of lines of code (change-size-cos) to calculate the BF.
arXiv Detail & Related papers (2024-01-06T20:55:40Z)
Towards a Structural Equation Model of Open Source Blockchain Software Health [0.0]
This work uses exploratory factor analysis to identify latent constructs that are representative of general public interest or popularity in software. We find that interest is a combination of stars, forks, and text mentions in the GitHub repository, while a second factor for robustness is composed of a criticality score. A structural model of software health is proposed such that general interest positively influences developer engagement, which, in turn, positively predicts software robustness.
arXiv Detail & Related papers (2023-10-31T08:47:41Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models [11.388023221294686]
This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks. We conduct a comprehensive empirical study using five established datasets to assess three open-source bLLMs in zero-shot and few-shot scenarios. Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions.
arXiv Detail & Related papers (2023-10-17T09:53:03Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
The Dimensions of Data Labor: A Road Map for Researchers, Activists, and Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits. Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape. By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z)
Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z)
Organizational Artifacts of Code Development [10.863006516392831]
We study social effects of country by measuring differences in software repositories associated with different countries. We propose a novel approach of modeling repositories based on their sequence of development activities as a sequence embedding task. We conduct a case study on repos from well-known corporations and find that country can describe the differences in development better than the company affiliation itself.
arXiv Detail & Related papers (2021-05-30T22:04:09Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.