On the Creation of Representative Samples of Software Repositories
- URL: http://arxiv.org/abs/2410.00639v2
- Date: Wed, 2 Oct 2024 07:18:32 GMT
- Title: On the Creation of Representative Samples of Software Repositories
- Authors: June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis Cánovas Izquierdo,
- Abstract summary: With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies.
Current sampling methods are often based on random selection or rely on variables which may not be related to the research study.
We present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study.
- Score: 1.8599311233727087
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with use cases based on Hugging Face repositories.
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - NLP-based Relation Extraction Methods in RE [4.856095570023289]
Mobile app repositories have been largely used in scientific research as large-scale, highly adaptive crowdsourced information systems.
We present MApp-KG, a combination of software resources and data artefacts to support extended knowledge generation tasks.
Our contribution aims to provide a framework for automatically constructing a knowledge graph modelling a domain-specific catalogue of mobile apps.
arXiv Detail & Related papers (2024-01-22T16:14:27Z) - Fingerprinting and Building Large Reproducible Datasets [3.2873782624127843]
We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance.
We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
arXiv Detail & Related papers (2023-06-20T08:59:33Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - DATED: Guidelines for Creating Synthetic Datasets for Engineering Design
Applications [3.463438487417909]
This study proposes comprehensive guidelines for generating, annotating, and validating synthetic datasets.
The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset.
Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design.
arXiv Detail & Related papers (2023-05-15T21:00:09Z) - Creating Synthetic Datasets for Collaborative Filtering Recommender
Systems using Generative Adversarial Networks [1.290382979353427]
Research and education in machine learning needs diverse, representative, and open datasets to handle the necessary training, validation, and testing tasks.
To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones.
This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets.
arXiv Detail & Related papers (2023-03-02T14:23:27Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Federated Learning under Importance Sampling [49.17137296715029]
We study the effect of importance sampling and devise schemes for sampling agents and data non-uniformly guided by a performance measure.
We find that in schemes involving sampling without replacement, the performance of the resulting architecture is controlled by two factors related to data variability at each agent.
arXiv Detail & Related papers (2020-12-14T10:08:55Z) - Empirical Study on the Software Engineering Practices in Open Source ML
Package Repositories [6.2894222252929985]
Modern Machine Learning technologies require considerable technical expertise and resources to develop, train and deploy such models.
Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories.
This paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories.
arXiv Detail & Related papers (2020-12-02T18:52:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.