The (ab)use of Open Source Code to Train Large Language Models
- URL: http://arxiv.org/abs/2302.13681v2
- Date: Tue, 28 Feb 2023 10:47:48 GMT
- Title: The (ab)use of Open Source Code to Train Large Language Models
- Authors: Ali Al-Kaswan and Maliheh Izadi
- Abstract summary: We discuss the security, privacy, and licensing implications of memorization.
We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma.
- Score: 0.8122270502556374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Large Language Models (LLMs) have gained significant
popularity due to their ability to generate human-like text and their potential
applications in various fields, such as Software Engineering. LLMs for Code are
commonly trained on large unsanitized corpora of source code scraped from the
Internet. The content of these datasets is memorized and emitted by the models,
often in a verbatim manner. In this work, we will discuss the security,
privacy, and licensing implications of memorization. We argue why the use of
copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide
four actionable recommendations to address this issue.
Related papers
- Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian [6.2250765474961405]
We conduct a survey to explore the perspectives on code readability in the age of large language models (LLMs)
We compare our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios.
Overall, the findings underscore that readability remains a critical aspect of software development.
arXiv Detail & Related papers (2025-01-20T04:11:21Z) - A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks.
This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z) - Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns [7.776434991976473]
This paper studies the deobfuscation capabilities of large language models (LLMs)
We evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign.
Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads.
arXiv Detail & Related papers (2024-04-30T17:06:27Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - On the Safety of Open-Sourced Large Language Models: Does Alignment
Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content.
Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z) - Calculating Originality of LLM Assisted Source Code [0.0]
We propose a neural network-based tool to determine the original effort (and LLM's contribution) put by students in writing source codes.
Our tool is motivated by minimum description length measures like Kolmogorov complexity.
arXiv Detail & Related papers (2023-07-10T11:30:46Z) - WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning.
Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.