Big Data = Big Insights? Operationalising Brooks' Law in a Massive
GitHub Data Set
- URL: http://arxiv.org/abs/2201.04588v1
- Date: Wed, 12 Jan 2022 17:25:30 GMT
- Title: Big Data = Big Insights? Operationalising Brooks' Law in a Massive
GitHub Data Set
- Authors: Christoph Gote, Pavlin Mavrodiev, Frank Schweitzer, Ingo Scholtes
- Abstract summary: We study challenges that can explain the disagreement between recent studies of developer productivity in massive repository data.
We provide, to the best of our knowledge, the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity.
- Score: 1.1470070927586014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massive data from software repositories and collaboration tools are widely
used to study social aspects in software development. One question that several
recent works have addressed is how a software project's size and structure
influence team productivity, a question famously considered in Brooks' law.
Recent studies using massive repository data suggest that developers in larger
teams tend to be less productive than smaller teams. Despite using similar
methods and data, other studies argue for a positive linear or even
super-linear relationship between team size and productivity, thus contesting
the view of software economics that software projects are diseconomies of
scale. In our work, we study challenges that can explain the disagreement
between recent studies of developer productivity in massive repository data. We
further provide, to the best of our knowledge, the largest, curated corpus of
GitHub projects tailored to investigate the influence of team size and
collaboration patterns on individual and collective productivity. Our work
contributes to the ongoing discussion on the choice of productivity metrics in
the operationalisation of hypotheses about determinants of successful software
projects. It further highlights general pitfalls in big data analysis and shows
that the use of bigger data sets does not automatically lead to more reliable
insights.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency.
We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people.
These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z) - DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle.
Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench.
Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z) - Guiding Effort Allocation in Open-Source Software Projects Using Bus
Factor Analysis [1.0878040851638]
Bus Factor (BF) of a project defined as 'the number of key developers who would need to be incapacitated to make a project unable to proceed'
We propose using other metrics like lines of code changes (LOCC) and cosine difference of lines of code (change-size-cos) to calculate the BF.
arXiv Detail & Related papers (2024-01-06T20:55:40Z) - Towards a Structural Equation Model of Open Source Blockchain Software
Health [0.0]
This work uses exploratory factor analysis to identify latent constructs that are representative of general public interest or popularity in software.
We find that interest is a combination of stars, forks, and text mentions in the GitHub repository, while a second factor for robustness is composed of a criticality score.
A structural model of software health is proposed such that general interest positively influences developer engagement, which, in turn, positively predicts software robustness.
arXiv Detail & Related papers (2023-10-31T08:47:41Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Revisiting Sentiment Analysis for Software Engineering in the Era of Large Language Models [11.388023221294686]
This study investigates bigger large language models (bLLMs) in addressing the labeled data shortage that hampers fine-tuned smaller large language models (sLLMs) in software engineering tasks.
We conduct a comprehensive empirical study using five established datasets to assess three open-source bLLMs in zero-shot and few-shot scenarios.
Our experimental findings demonstrate that bLLMs exhibit state-of-the-art performance on datasets marked by limited training data and imbalanced distributions.
arXiv Detail & Related papers (2023-10-17T09:53:03Z) - The Dimensions of Data Labor: A Road Map for Researchers, Activists, and
Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits.
Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape.
By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Organizational Artifacts of Code Development [10.863006516392831]
We study social effects of country by measuring differences in software repositories associated with different countries.
We propose a novel approach of modeling repositories based on their sequence of development activities as a sequence embedding task.
We conduct a case study on repos from well-known corporations and find that country can describe the differences in development better than the company affiliation itself.
arXiv Detail & Related papers (2021-05-30T22:04:09Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.