Related papers: An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets

An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets

URL: http://arxiv.org/abs/2403.15230v1
Date: Fri, 22 Mar 2024 14:23:21 GMT
Title: An Exploratory Investigation into Code License Infringements in Large Language Model Training Datasets
Authors: Jonathan Katzy, Răzvan-Mihai Popescu, Arie van Deursen, Maliheh Izadi,
Abstract summary: We assess the current trends in the field and the importance of incorporating code into the training of large language models. We examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future.
Score: 13.134215997081157
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of legal issues in the future. To accomplish this, we compiled a list of 53 large language models trained on file-level code. We then extracted their datasets and analyzed how much they overlap with a dataset we created, consisting exclusively of strong copyleft code. Our analysis revealed that every dataset we examined contained license inconsistencies, despite being selected based on their associated repository licenses. We analyzed a total of 514 million code files, discovering 38 million exact duplicates present in our strong copyleft dataset. Additionally, we examined 171 million file-leading comments, identifying 16 million with strong copyleft licenses and another 11 million comments that discouraged copying without explicitly mentioning a license. Based on the findings of our study, which highlights the pervasive issue of license inconsistencies in large language models trained on code, our recommendation for both researchers and the community is to prioritize the development and adoption of best practices for dataset creation and management.

Related papers

Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training [6.00143998001152]
We introduce Common Corpus, the largest open dataset for language model pre-training.<n>The dataset contains a wide variety of languages, ranging from the main European languages to low-resource ones rarely present in pre-training datasets.
arXiv Detail & Related papers (2025-06-02T14:43:15Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models [13.134215997081157]
We release The Heap, a large multilingual dataset covering 57 programming languages. This dataset enables researchers to conduct fair evaluations of large language models without significant data cleaning overhead.
arXiv Detail & Related papers (2025-01-16T16:48:41Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation. Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z)
StarCoder 2 and The Stack v2: The Next Generation [105.93298676368798]
We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens. We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
arXiv Detail & Related papers (2024-02-29T13:53:35Z)
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data [89.2410799619405]
We introduce the Quantitative Reasoning with Data benchmark to evaluate Large Language Models' capability in statistical and causal reasoning with real-world data. The benchmark comprises a dataset of 411 questions accompanied by data sheets from textbooks, online learning materials, and academic papers. To compare models' quantitative reasoning abilities on data and text, we enrich the benchmark with an auxiliary set of 290 text-only questions, namely QRText.
arXiv Detail & Related papers (2024-02-27T16:15:03Z)
Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code [13.135962181354465]
Code auditing ensures that developed code adheres to standards, regulations, and copyright protection. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. We propose TraWiC; a model-agnostic and interpretable method for detecting code inclusion in an LLM's training dataset.
arXiv Detail & Related papers (2024-02-14T16:41:35Z)
Traces of Memorisation in Large Language Models for Code [16.125924759649106]
Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. We compare the rate of memorisation with large language models trained on natural language. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts.
arXiv Detail & Related papers (2023-12-18T19:12:58Z)
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets. frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z)
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed. We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z)
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation [5.2510537676167335]
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages. Our evaluations show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet.
arXiv Detail & Related papers (2023-05-09T09:35:03Z)
The (ab)use of Open Source Code to Train Large Language Models [0.8122270502556374]
We discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma.
arXiv Detail & Related papers (2023-02-27T11:34:53Z)
Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives. Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.