Related papers: On the Creation of Representative Samples of Software Repositories

On the Creation of Representative Samples of Software Repositories

URL: http://arxiv.org/abs/2410.00639v2
Date: Wed, 2 Oct 2024 07:18:32 GMT
Title: On the Creation of Representative Samples of Software Repositories
Authors: June Gorostidi, Adem Ait, Jordi Cabot, Javier Luis Cánovas Izquierdo,
Abstract summary: With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. Current sampling methods are often based on random selection or rely on variables which may not be related to the research study. We present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study.
Score: 1.8599311233727087
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Software repositories is one of the sources of data in Empirical Software Engineering, primarily in the Mining Software Repositories field, aimed at extracting knowledge from the dynamics and practice of software projects. With the emergence of social coding platforms such as GitHub, researchers have now access to millions of software repositories to use as source data for their studies. With this massive amount of data, sampling techniques are needed to create more manageable datasets. The creation of these datasets is a crucial step, and researchers have to carefully select the repositories to create representative samples according to a set of variables of interest. However, current sampling methods are often based on random selection or rely on variables which may not be related to the research study (e.g., popularity or activity). In this paper, we present a methodology for creating representative samples of software repositories, where such representativeness is properly aligned with both the characteristics of the population of repositories and the requirements of the empirical study. We illustrate our approach with use cases based on Hugging Face repositories.

Related papers

OpenDORS: A dataset of openly referenced open research software [1.0026496861838448]
We present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature.<n>Each dataset record identifies the referencing publication and lists source code repositories of the software project.<n>For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files.
arXiv Detail & Related papers (2025-12-01T11:45:50Z)
Who Do You Think You Are? Creating RSE Personas from GitHub Interactions [0.0]
We describe an approach combining software repository mining and data-driven personas applied to research software (RS) development.<n>This allows individuals and RS project teams to understand their contributions, impact and repository dynamics.<n>We demonstrate how the RSE personas method successfully characterises a sample of 115,174 repository contributors across 1,284 RS repositories on GitHub.
arXiv Detail & Related papers (2025-10-06T21:35:05Z)
Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z)
SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories. evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation. We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z)
Making Sense of Data in the Wild: Data Analysis Automation at Scale [0.1747623282473278]
We propose a novel approach that combines intelligent agents with retrieval augmented generation to automate data analysis, dataset curation and indexing at scale. We demonstrate that our approach results in more detailed dataset descriptions, higher hit rates and greater diversity in dataset retrieval tasks.
arXiv Detail & Related papers (2025-01-27T10:04:10Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
NLP-based Relation Extraction Methods in RE [4.856095570023289]
Mobile app repositories have been largely used in scientific research as large-scale, highly adaptive crowdsourced information systems. We present MApp-KG, a combination of software resources and data artefacts to support extended knowledge generation tasks. Our contribution aims to provide a framework for automatically constructing a knowledge graph modelling a domain-specific catalogue of mobile apps.
arXiv Detail & Related papers (2024-01-22T16:14:27Z)
Fingerprinting and Building Large Reproducible Datasets [3.2873782624127843]
We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
arXiv Detail & Related papers (2023-06-20T08:59:33Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
DATED: Guidelines for Creating Synthetic Datasets for Engineering Design Applications [3.463438487417909]
This study proposes comprehensive guidelines for generating, annotating, and validating synthetic datasets. The study underscores the importance of thoughtful sampling methods to ensure the appropriate size, diversity, utility, and realism of a dataset. Overall, this paper offers valuable insights for researchers intending to create and publish synthetic datasets for engineering design.
arXiv Detail & Related papers (2023-05-15T21:00:09Z)
Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks [1.290382979353427]
Research and education in machine learning needs diverse, representative, and open datasets to handle the necessary training, validation, and testing tasks. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets.
arXiv Detail & Related papers (2023-03-02T14:23:27Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
Federated Learning under Importance Sampling [49.17137296715029]
We study the effect of importance sampling and devise schemes for sampling agents and data non-uniformly guided by a performance measure. We find that in schemes involving sampling without replacement, the performance of the resulting architecture is controlled by two factors related to data variability at each agent.
arXiv Detail & Related papers (2020-12-14T10:08:55Z)
Empirical Study on the Software Engineering Practices in Open Source ML Package Repositories [6.2894222252929985]
Modern Machine Learning technologies require considerable technical expertise and resources to develop, train and deploy such models. Such discovery and reuse by practitioners and researchers are being addressed by public ML package repositories. This paper conducts an exploratory study that analyzes the structure and contents of two popular ML package repositories.
arXiv Detail & Related papers (2020-12-02T18:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.