The OCEAN mailing list data set: Network analysis spanning mailing lists
and code repositories
- URL: http://arxiv.org/abs/2204.00603v1
- Date: Fri, 1 Apr 2022 17:50:15 GMT
- Title: The OCEAN mailing list data set: Network analysis spanning mailing lists
and code repositories
- Authors: Melanie Warrick, Samuel F. Rosenblatt, Jean-Gabriel Young, Amanda
Casari, Laurent H\'ebert-Dufresne, James Bagrow
- Abstract summary: We combine and standardize mailing lists of the Python community, resulting in 954,287 messages from 1995 to the present.
To showcase the usefulness of these data, we focus on the CPython repository and merge the technical layer with the social layer.
We discuss how these data provide a laboratory to test theories from standard organizational science in large open source projects.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Communication surrounding the development of an open source project largely
occurs outside the software repository itself. Historically, large communities
often used a collection of mailing lists to discuss the different aspects of
their projects. Multimodal tool use, with software development and
communication happening on different channels, complicates the study of open
source projects as a sociotechnical system. Here, we combine and standardize
mailing lists of the Python community, resulting in 954,287 messages from 1995
to the present. We share all scraping and cleaning code to facilitate
reproduction of this work, as well as smaller datasets for the Golang (122,721
messages), Angular (20,041 messages) and Node.js (12,514 messages) communities.
To showcase the usefulness of these data, we focus on the CPython repository
and merge the technical layer (which GitHub account works on what file and with
whom) with the social layer (messages from unique email addresses) by
identifying 33% of GitHub contributors in the mailing list data. We then
explore correlations between the valence of social messaging and the structure
of the collaboration network. We discuss how these data provide a laboratory to
test theories from standard organizational science in large open source
projects.
Related papers
- Repository-level Code Search with Neural Retrieval Methods [25.222964965449286]
We define the task of repository-level code search as retrieving the set of files from the current state of a code repository that are most relevant to addressing a user's question or bug.
The proposed approach combines BM25-based retrieval over commit messages with neural reranking using CodeBERT to identify the most pertinent files.
Experiments on a new dataset created from 7 popular open-source repositories demonstrate substantial improvements of up to 80% in MAP, MRR and P@1 over the BM25 baseline.
arXiv Detail & Related papers (2025-02-10T21:59:01Z) - SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - Long Code Arena: a Set of Benchmarks for Long-Context Code Models [75.70507534322336]
Long Code Arena is a suite of six benchmarks for code processing tasks that require project-wide context.
These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization.
For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions.
arXiv Detail & Related papers (2024-06-17T14:58:29Z) - Causal-learn: Causal Discovery in Python [53.17423883919072]
Causal discovery aims at revealing causal relations from observational data.
$textitcausal-learn$ is an open-source Python library for causal discovery.
arXiv Detail & Related papers (2023-07-31T05:00:35Z) - Knowledge-Grounded Conversational Data Augmentation with Generative
Conversational Networks [76.11480953550013]
We take a step towards automatically generating conversational data using Generative Conversational Networks.
We evaluate our approach on conversations with and without knowledge on the Topical Chat dataset.
arXiv Detail & Related papers (2022-07-22T22:37:14Z) - PyTorch Geometric Signed Directed: A Software Package on Graph Neural
Networks for Signed and Directed Graphs [20.832917829426098]
PyTorch Geometric Signed Directed (PyGSD) is a software package for signed and directed networks.
PyGSD consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions.
As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks.
arXiv Detail & Related papers (2022-02-22T10:25:59Z) - LAGOON: An Analysis Tool for Open Source Communities [7.3861897382622015]
LAGOON is an open source platform for understanding the ecosystems of Open Source Software (OSS) communities.
LAGOON ingests artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from websites.
A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph.
arXiv Detail & Related papers (2022-01-26T18:52:11Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs [11.523471275501857]
We create a new dataset of GitHub projects called LabelGit.
Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers.
We hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
arXiv Detail & Related papers (2021-03-16T07:28:58Z) - Distributed Learning in the Non-Convex World: From Batch to Streaming
Data, and Beyond [73.03743482037378]
Distributed learning has become a critical direction of the massively connected world envisioned by many.
This article discusses four key elements of scalable distributed processing and real-time data computation problems.
Practical issues and future research will also be discussed.
arXiv Detail & Related papers (2020-01-14T14:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.