Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development
- URL: http://arxiv.org/abs/2511.20623v1
- Date: Tue, 25 Nov 2025 18:46:14 GMT
- Title: Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development
- Authors: David Szczecina, Senan Gaffori, Edmond Li,
- Abstract summary: This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in training datasets.<n>With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The widespread use of Large Language Models (LLMs) raises critical concerns regarding the unauthorized inclusion of copyrighted content in training data. Existing detection frameworks, such as DE-COP, are computationally intensive, and largely inaccessible to independent creators. As legal scrutiny increases, there is a pressing need for a scalable, transparent, and user-friendly solution. This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. Our approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% with efficient API calls. With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance, facilitating the foundation for further research in responsible AI development and copyright enforcement.
Related papers
- Algorithmic Identity Based on Metaparameters: A Path to Reliability, Auditability, and Traceability [0.0]
The use of algorithms is increasing across various fields such as healthcare, justice, finance, and education.<n>This article explores the potential of the Digital Object Identifier (DOI) to identify algorithms.<n>The use of DOIs facilitates tracking the origin of algorithms, enables audits, prevents biases, promotes research, and strengthens ethical considerations.
arXiv Detail & Related papers (2026-01-21T07:35:14Z) - Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z) - ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training [1.4709455282157278]
Auto-Train for Code Translation (ACT) aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs)<n>ACT's automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions.<n>Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative.
arXiv Detail & Related papers (2025-07-22T11:35:35Z) - Does Machine Unlearning Truly Remove Knowledge? [80.83986295685128]
We introduce a comprehensive auditing framework for unlearning evaluation comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods.<n>We evaluate the effectiveness and robustness of different unlearning strategies.
arXiv Detail & Related papers (2025-05-29T09:19:07Z) - Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL)<n>Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations.<n>These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z) - CAP: Detecting Unauthorized Data Usage in Generative Models via Prompt Generation [1.6141139250981018]
Copyright Audit via Prompts generation (CAP) is a framework for automatically testing whether an ML model has been trained with unauthorized data.
Specifically, we devise an approach to generate suitable keys inducing the model to reveal copyrighted contents.
To prove its effectiveness, we conducted an extensive evaluation campaign on measurements collected in four IoT scenarios.
arXiv Detail & Related papers (2024-10-08T08:49:41Z) - Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material.
This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs.
We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.<n>Existing approaches require re-training models on different data subsets, which is computationally intensive.<n>This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding
over Small Language Models [53.83273575102087]
We propose an unsupervised inference-time approach to authorship obfuscation.
We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation.
Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM's APIs.
arXiv Detail & Related papers (2024-02-13T19:54:29Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Digger: Detecting Copyright Content Mis-usage in Large Language Model
Training [23.99093718956372]
We introduce a framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of Large Language Models (LLMs)
This framework also provides a confidence estimation for the likelihood of each content sample's inclusion.
arXiv Detail & Related papers (2024-01-01T06:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.