The Software Heritage License Dataset (2022 Edition)
- URL: http://arxiv.org/abs/2308.11258v1
- Date: Tue, 22 Aug 2023 08:01:07 GMT
- Title: The Software Heritage License Dataset (2022 Edition)
- Authors: Jes\'us M. Gonz\'alez-Barahona (URJC), Sergio Montes-Leon (URJC),
Gregorio Robles (URJC), Stefano Zacchiroli (IP Paris, LTCI)
- Abstract summary: The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided.
The dataset can be used to conduct empirical studies on open source licensing, training of automated license cryptographics, natural language processing (NLP) analyses of legal texts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context: When software is released publicly, it is common to include with it
either the full text of the license or licenses under which it is published, or
a detailed reference to them. Therefore public licenses, including FOSS (free,
open source software) licenses, are usually publicly available in source code
repositories.Objective: To compile a dataset containing as many documents as
possible that contain the text of software licenses, or references to the
license terms. Once compiled, characterize the dataset so that it can be used
for further research, or practical purposes related to license analysis.Method:
Retrieve from Software Heritage-the largest publicly available archive of FOSS
source code-all versions of all files whose names are commonly used to convey
licensing terms. All retrieved documents will be characterized in various ways,
using automated and manual analyses.Results: The dataset consists of 6.9
million unique license files. Additional metadata about shipped license files
is also provided, making the dataset ready to use in various contexts,
including: file length measures, MIME type, SPDX license (detected using
ScanCode), and oldest appearance. The results of a manual analysis of 8102
documents is also included, providing a ground truth for further analysis. The
dataset is released as open data as an archive file containing all deduplicated
license files, plus several portable CSV files with metadata, referencing files
via cryptographic checksums.Conclusions: Thanks to the extensive coverage of
Software Heritage, the dataset presented in this paper covers a very large
fraction of all software licenses for public code. We have assembled a large
body of software licenses, characterized it quantitatively and qualitatively,
and validated that it is mostly composed of licensing information and includes
almost all known license texts. The dataset can be used to conduct empirical
studies on open source licensing, training of automated license classifiers,
natural language processing (NLP) analyses of legal texts, as well as
historical and phylogenetic studies on FOSS licensing. It can also be used in
practice to improve tools detecting licenses in source code.
Related papers
- OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
We employ an exhaustive approach, scanning all files containing license'' in their filepath, and apply the winnowing algorithm for robust text matching.
Our method identifies and matches over 5.5 million distinct license blobs across millions of OSS projects, creating a detailed project-to-license (P2L) map.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets [13.134215997081157]
We assess the current trends in the field and the importance of incorporating code into the training of large language models.
We examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future.
arXiv Detail & Related papers (2024-03-22T14:23:21Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model [73.38800189095173]
This work focuses on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.
By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper.
M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes.
arXiv Detail & Related papers (2023-11-30T04:43:26Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text.
We show that this problem can be tackled by watermarking.
We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z) - LiSum: Open Source Software License Summarization with Multi-Task
Learning [16.521420821183995]
Open source software (OSS) licenses regulate the conditions under which users can reuse, modify, and distribute the software legally.
There exist various OSS licenses in the community, written in a formal language, which are typically long and complicated to understand.
Motivated by the user study and the fast growth of licenses in the community, we propose the first study towards automated license summarization.
arXiv Detail & Related papers (2023-09-10T16:43:51Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - Can I use this publicly available dataset to build commercial AI
software? Most likely not [8.853674186565934]
We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
arXiv Detail & Related papers (2021-11-03T17:44:06Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.