OSS License Identification at Scale: A Comprehensive Dataset Using World of Code
- URL: http://arxiv.org/abs/2409.04824v1
- Date: Sat, 7 Sep 2024 13:34:55 GMT
- Title: OSS License Identification at Scale: A Comprehensive Dataset Using World of Code
- Authors: Mahmoud Jahanshahi, David Reid, Adam McDaniel, Audris Mockus,
- Abstract summary: We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching.
Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
- Score: 4.954816514146113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The proliferation of open source software (OSS) has led to a complex landscape of licensing practices, making accurate license identification crucial for legal and compliance purposes. This study presents a comprehensive analysis of OSS licenses using the World of Code (WoC) infrastructure. We employ an exhaustive approach, scanning all files containing ``license'' in their filepath, and apply the winnowing algorithm for robust text matching. Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map. We verify the accuracy of our approach through stratified sampling and manual review, achieving a final accuracy of 92.08%, with precision of 87.14%, recall of 95.45%, and an F1 score of 91.11%. This work enhances the understanding of OSS licensing practices and provides a valuable resource for developers, researchers, and legal professionals. Future work will expand the scope of license detection to include code files and references to licenses in project documentation.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning, tasks and agent systems.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an open cookbook'' for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets [13.134215997081157]
We assess the current trends in the field and the importance of incorporating code into the training of large language models.
We examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future.
arXiv Detail & Related papers (2024-03-22T14:23:21Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - LiSum: Open Source Software License Summarization with Multi-Task
Learning [16.521420821183995]
Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally.
There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand.
Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
arXiv Detail & Related papers (2023-09-10T16:43:51Z) - The Software Heritage License Dataset (2022 Edition) [0.0]
The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided.
The dataset can be used to conduct empirical studies on open source licensing, training of automated license cryptographics, natural language processing (NLP) analyses of legal texts.
arXiv Detail & Related papers (2023-08-22T08:01:07Z) - LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software.
Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z) - LLMDet: A Third Party Large Language Models Generated Text Detection
Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text.
Existing detection tools can only differentiate between machine-generated and human-authored text.
We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z) - FAT Forensics: A Python Toolbox for Implementing and Deploying Fairness,
Accountability and Transparency Algorithms in Predictive Systems [69.24490096929709]
We developed an open source Python package called FAT Forensics.
It can inspect important fairness, accountability and transparency aspects of predictive algorithms.
Our toolbox can evaluate all elements of a predictive pipeline.
arXiv Detail & Related papers (2022-09-08T13:25:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.