Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and
Ethics
- URL: http://arxiv.org/abs/2304.02839v1
- Date: Thu, 6 Apr 2023 03:09:26 GMT
- Title: Whose Text Is It Anyway? Exploring BigCode, Intellectual Property, and
Ethics
- Authors: Madiha Zahrah Choksi, and David Goedicke
- Abstract summary: This position paper probes the copyright interests of open data sets used to train large language models (LLMs)
Our paper asks, how do LLMs trained on open data sets circumvent the copyright interests of the used data?
- Score: 1.933681537640272
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Intelligent or generative writing tools rely on large language models that
recognize, summarize, translate, and predict content. This position paper
probes the copyright interests of open data sets used to train large language
models (LLMs). Our paper asks, how do LLMs trained on open data sets circumvent
the copyright interests of the used data? We start by defining software
copyright and tracing its history. We rely on GitHub Copilot as a modern case
study challenging software copyright. Our conclusion outlines obstacles that
generative writing assistants create for copyright, and offers a practical road
map for copyright analysis for developers, software law experts, and general
users to consider in the context of intelligent LLM-powered writing tools.
Related papers
- Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.
We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)
We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z) - SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation [24.644101178288476]
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns due to their potential to produce text that infringes on copyrights.
This paper introduces a curated dataset to evaluate methods, test attack strategies, and propose real-time defenses to prevent the generation of copyrighted text.
arXiv Detail & Related papers (2024-06-18T18:00:03Z) - LLMs and Memorization: On Quality and Specificity of Copyright Compliance [0.0]
Memorization in large language models (LLMs) is a growing concern.
LLMs have been shown to easily reproduce parts of their training data, including copyrighted work.
This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
arXiv Detail & Related papers (2024-05-28T18:01:52Z) - Copyright Protection in Generative AI: A Technical Perspective [58.84343394349887]
Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code.
The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns.
This work delves into this issue by providing a comprehensive overview of copyright protection from a technical perspective.
arXiv Detail & Related papers (2024-02-04T04:00:33Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization.
We present experiments with a range of language models over a collection of popular books and coding problems.
Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z) - WASA: WAtermark-based Source Attribution for Large Language
Model-Generated Data [60.759755177369364]
Large language models (LLMs) generate synthetic texts with embedded watermarks that contain information about their source(s)
We propose a WAtermarking for Source Attribution (WASA) framework that satisfies key properties due to our algorithmic designs.
Our framework achieves effective source attribution and data provenance.
arXiv Detail & Related papers (2023-10-01T12:02:57Z) - LLMDet: A Third Party Large Language Models Generated Text Detection
Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text.
Existing detection tools can only differentiate between machine-generated and human-authored text.
We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z) - Are You Copying My Model? Protecting the Copyright of Large Language
Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs)
E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs.
We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.