Dataset: Copy-based Reuse in Open Source Software
- URL: http://arxiv.org/abs/2312.09370v1
- Date: Thu, 14 Dec 2023 22:08:09 GMT
- Title: Dataset: Copy-based Reuse in Open Source Software
- Authors: Mahmoud Jahanshahi, Audris Mockus
- Abstract summary: In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions.
This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS.
- Score: 5.917654223291073
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In Open Source Software, the source code and any other resources available in
a project can be viewed or reused by anyone subject to often permissive
licensing restrictions. In contrast to some studies of dependency-based reuse
supported via package managers, no studies of OSS-wide copy-based reuse exist.
This dataset seeks to encourage the studies of OSS-wide copy-based reuse by
providing copying activity data that captures whole-file reuse in nearly all
OSS. To accomplish that, we develop approaches to detect copy-based reuse by
developing an efficient algorithm that exploits World of Code infrastructure: a
curated and cross referenced collection of nearly all open source repositories.
We expect this data to enable future research and tool development that support
such reuse and minimize associated risks.
Related papers
- An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries [52.23798016734889]
This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries.
The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges.
arXiv Detail & Related papers (2024-09-27T16:20:20Z) - Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development [5.412781090113212]
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself.
Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse.
arXiv Detail & Related papers (2024-09-07T13:50:40Z) - OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching.
Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - Source Code Archiving to the Rescue of Reproducible Deployment [2.53740603524637]
We describe our work connecting Guix with Software Heritage, the universal source code archive, making Guix the first free software distribution and tool backed by a stable archive.
Our contribution is twofold: we explain the rationale and present the design and implementation we came up with; second, we report on the archival coverage for package source code with data collected over five years and discuss remaining challenges.
arXiv Detail & Related papers (2024-05-24T13:00:28Z) - The Software Heritage Open Science Ecosystem [0.0]
Software Heritage is the largest public archive of software source code and associated development history.
It has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects.
It supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code.
It ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments.
arXiv Detail & Related papers (2023-10-16T11:32:03Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Repro: An Open-Source Library for Improving the Reproducibility and
Usability of Publicly Available Research Code [74.28810048824519]
Repro is an open-source library which aims at improving the usability of research code.
It provides a lightweight Python API for running software released by researchers within Docker containers.
arXiv Detail & Related papers (2022-04-29T01:54:54Z) - Defining the role of open source software in research reproducibility [0.0]
I make a new proposal for the role of open source software.
I look for explanation of its success from the perspectives of connectivism.
I contend that engenders trust, which we routinely build in community via conversations.
arXiv Detail & Related papers (2022-04-26T19:52:47Z) - Nine Best Practices for Research Software Registries and Repositories: A
Concise Guide [63.52960372153386]
We present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories.
These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Implementation Working Group during the years 2011 and 2012.
arXiv Detail & Related papers (2020-12-24T05:37:54Z) - Universal Source-Free Domain Adaptation [57.37520645827318]
We propose a novel two-stage learning process for domain adaptation.
In the Procurement stage, we aim to equip the model for future source-free deployment, assuming no prior knowledge of the upcoming category-gap and domain-shift.
In the Deployment stage, the goal is to design a unified adaptation algorithm capable of operating across a wide range of category-gaps.
arXiv Detail & Related papers (2020-04-09T07:26:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.