The Software Heritage Open Science Ecosystem
- URL: http://arxiv.org/abs/2310.10295v1
- Date: Mon, 16 Oct 2023 11:32:03 GMT
- Title: The Software Heritage Open Science Ecosystem
- Authors: Roberto Di Cosmo (UPCit\'e), Stefano Zacchiroli (IP Paris, LTCI)
- Abstract summary: Software Heritage is the largest public archive of software source code and associated development history.
It has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects.
It supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code.
It ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software Heritage is the largest public archive of software source code and
associated development history, as captured by modern version control systems.
As of July 2023, it has archived more than 16 billion unique source code files
coming from more than 250 million collaborative development projects. In this
chapter, we describe the Software Heritage ecosystem, focusing on research and
open science use cases.On the one hand, Software Heritage supports empirical
research on software by materializing in a single Merkle direct acyclic graph
the development history of public code. This giant graph of source code
artifacts (files, directories, and commits) can be used-and has been used-to
study repository forks, open source contributors, vulnerability propagation,
software provenance tracking, source code indexing, and more.On the other hand,
Software Heritage ensures availability and guarantees integrity of the source
code of software artifacts used in any field that relies on software to conduct
experiments, contributing to making research reproducible. The source code used
in scientific experiments can be archived-e.g., via integration with
open-access repositories-referenced using persistent identifiers that allow
downstream integrity checks and linked to/from other scholarly digital
artifacts.
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries [52.23798016734889]
This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries.
The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges.
arXiv Detail & Related papers (2024-09-27T16:20:20Z) - Knowledge Islands: Visualizing Developers Knowledge Concentration [0.0]
Knowledge Islands is a tool that visualizes the concentration of knowledge in a software repository using a state-of-the-art knowledge model.
It enables practitioners to analyze GitHub projects, determine where knowledge is concentrated, and implement measures to maintain project health.
arXiv Detail & Related papers (2024-08-16T13:32:49Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Source Code Archiving to the Rescue of Reproducible Deployment [2.53740603524637]
We describe our work connecting Guix with Software Heritage, the universal source code archive, making Guix the first free software distribution and tool backed by a stable archive.
Our contribution is twofold: we explain the rationale and present the design and implementation we came up with; second, we report on the archival coverage for package source code with data collected over five years and discuss remaining challenges.
arXiv Detail & Related papers (2024-05-24T13:00:28Z) - Dataset: Copy-based Reuse in Open Source Software [5.917654223291073]
In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions.
This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS.
arXiv Detail & Related papers (2023-12-14T22:08:09Z) - SciCat: A Curated Dataset of Scientific Software Repositories [4.77982299447395]
We introduce the SciCat dataset -- a comprehensive collection of Free-Libre Open Source Software (FLOSS) projects.
Our approach involves selecting projects from a pool of 131 million deforked repositories from the World of Code data source.
Our classification focuses on software designed for scientific purposes, research-related projects, and research support software.
arXiv Detail & Related papers (2023-12-11T13:46:33Z) - Collaborative, Code-Proximal Dynamic Software Visualization within Code
Editors [55.57032418885258]
This paper introduces the design and proof-of-concept implementation for a software visualization approach that can be embedded into code editors.
Our contribution differs from related work in that we use dynamic analysis of a software system's runtime behavior.
Our visualization approach enhances common remote pair programming tools and is collaboratively usable by employing shared code cities.
arXiv Detail & Related papers (2023-08-30T06:35:40Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - LabelGit: A Dataset for Software Repositories Classification using
Attributed Dependency Graphs [11.523471275501857]
We create a new dataset of GitHub projects called LabelGit.
Our dataset uses direct information from the source code, like the dependency graph and source code neural representations from the identifiers.
We hope to aid the development of solutions that do not rely on proxies but use the entire source code to perform classification.
arXiv Detail & Related papers (2021-03-16T07:28:58Z) - Nine Best Practices for Research Software Registries and Repositories: A
Concise Guide [63.52960372153386]
We present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories.
These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Implementation Working Group during the years 2011 and 2012.
arXiv Detail & Related papers (2020-12-24T05:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.