Related papers: Dataset: Copy-based Reuse in Open Source Software

Dataset: Copy-based Reuse in Open Source Software

URL: http://arxiv.org/abs/2312.09370v1
Date: Thu, 14 Dec 2023 22:08:09 GMT
Title: Dataset: Copy-based Reuse in Open Source Software
Authors: Mahmoud Jahanshahi, Audris Mockus
Abstract summary: In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions. This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS.
Score: 5.917654223291073
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In Open Source Software, the source code and any other resources available in a project can be viewed or reused by anyone subject to often permissive licensing restrictions. In contrast to some studies of dependency-based reuse supported via package managers, no studies of OSS-wide copy-based reuse exist. This dataset seeks to encourage the studies of OSS-wide copy-based reuse by providing copying activity data that captures whole-file reuse in nearly all OSS. To accomplish that, we develop approaches to detect copy-based reuse by developing an efficient algorithm that exploits World of Code infrastructure: a curated and cross referenced collection of nearly all open source repositories. We expect this data to enable future research and tool development that support such reuse and minimize associated risks.

Related papers

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks [48.83701310501069]
We introduce NN-RAG, a retrieval-augmented generation system that converts large, heterogeneous PyTorchs into a searchable library of validated neural modules.<n>Applying to 19 major repositories, the pipeline extracted 1,289 candidate blocks, validated 941 (73.0%), and demonstrated that over 80% are structurally unique.
arXiv Detail & Related papers (2025-12-03T23:28:30Z)
Open Source at a Crossroads: The Future of Licensing Driven by Monetization [11.149764135999437]
Open Source Software Licenses (OSS licenses) ensure that software can be sold or distributed as part of aggregate programs from various sources without requiring a royalty or fee. We argue that open source is at a crossroads, with a growing need to redefine its licensing models and support communities and critical software.
arXiv Detail & Related papers (2025-03-04T17:44:01Z)
Making Software FAIR: A machine-assisted workflow for the research software lifecycle [2.682583873311538]
SoFAIR will extend the capabilities of widely used open scholarly infrastructures. It will deliver and deploy an effective solution for the management of the research software lifecycle.
arXiv Detail & Related papers (2025-01-08T14:17:26Z)
An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries [52.23798016734889]
This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries. The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges.
arXiv Detail & Related papers (2024-09-27T16:20:20Z)
Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development [5.412781090113212]
In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse.
arXiv Detail & Related papers (2024-09-07T13:50:40Z)
OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching. Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
arXiv Detail & Related papers (2024-09-07T13:34:55Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
Source Code Archiving to the Rescue of Reproducible Deployment [2.53740603524637]
We describe our work connecting Guix with Software Heritage, the universal source code archive, making Guix the first free software distribution and tool backed by a stable archive. Our contribution is twofold: we explain the rationale and present the design and implementation we came up with; second, we report on the archival coverage for package source code with data collected over five years and discuss remaining challenges.
arXiv Detail & Related papers (2024-05-24T13:00:28Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
The Software Heritage Open Science Ecosystem [0.0]
Software Heritage is the largest public archive of software source code and associated development history. It has archived more than 16 billion unique source code files coming from more than 250 million collaborative development projects. It supports empirical research on software by materializing in a single Merkle direct acyclic graph the development history of public code. It ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments.
arXiv Detail & Related papers (2023-10-16T11:32:03Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
Repro: An Open-Source Library for Improving the Reproducibility and Usability of Publicly Available Research Code [74.28810048824519]
Repro is an open-source library which aims at improving the usability of research code. It provides a lightweight Python API for running software released by researchers within Docker containers.
arXiv Detail & Related papers (2022-04-29T01:54:54Z)
Defining the role of open source software in research reproducibility [0.0]
I make a new proposal for the role of open source software. I look for explanation of its success from the perspectives of connectivism. I contend that engenders trust, which we routinely build in community via conversations.
arXiv Detail & Related papers (2022-04-26T19:52:47Z)
Nine Best Practices for Research Software Registries and Repositories: A Concise Guide [63.52960372153386]
We present a set of nine best practices that can help managers define the scope, practices, and rules that govern individual registries and repositories. These best practices were distilled from the experiences of the creators of existing resources, convened by a Task Force of the FORCE11 Software Implementation Working Group during the years 2011 and 2012.
arXiv Detail & Related papers (2020-12-24T05:37:54Z)
Universal Source-Free Domain Adaptation [57.37520645827318]
We propose a novel two-stage learning process for domain adaptation. In the Procurement stage, we aim to equip the model for future source-free deployment, assuming no prior knowledge of the upcoming category-gap and domain-shift. In the Deployment stage, the goal is to design a unified adaptation algorithm capable of operating across a wide range of category-gaps.
arXiv Detail & Related papers (2020-04-09T07:26:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.