Related papers: Copyright in AI Pre-Training Data Filtering: Regulatory Landscape and Mitigation Strategies

Copyright in AI Pre-Training Data Filtering: Regulatory Landscape and Mitigation Strategies

URL: http://arxiv.org/abs/2512.02047v1
Date: Wed, 26 Nov 2025 14:02:45 GMT
Title: Copyright in AI Pre-Training Data Filtering: Regulatory Landscape and Mitigation Strategies
Authors: Mariia Kyrychenko, Mykyta Mudryi, Markiyan Chaklosh,
Abstract summary: The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data.<n>This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region.<n>It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data, yet current regulatory frameworks remain predominantly reactive rather than proactive. This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region. It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development. Through analysis of major cases we identified critical gaps in pre-training data filtering. Existing solutions such as transparency tools, perceptual hashing, and access control mechanisms address only specific aspects of the problem and cannot prevent initial copyright violations. We identify two fundamental challenges: pre-training license collection and content filtering, which faces the impossibility of comprehensive copyright management at scale, and verification mechanisms, which lack tools to confirm filtering prevented infringement. We propose a multilayered filtering pipeline that combines access control, content verification, machine learning classifiers, and continuous database cross-referencing to shift copyright protection from post-training detection to pre-training prevention. This approach offers a pathway toward protecting creator rights while enabling continued AI innovation.

Related papers

Global AI Governance Overview: Understanding Regulatory Requirements Across Global Jurisdictions [0.0]
The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data.<n>This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region.<n>It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development.
arXiv Detail & Related papers (2025-11-26T13:59:11Z)
MAIF: Enforcing AI Trust and Provenance with an Artifact-Centric Agentic Paradigm [0.5495755145898128]
Current AI systems operate on opaque data structures that lack the audit trails, provenance tracking, or explainability required by emerging regulations like the EU AI Act.<n>We propose an artifact-centric AI agent paradigm where behavior is driven by persistent, verifiable data artifacts rather than ephemeral tasks.<n>Production-ready implementation demonstrates ultra-high-speed streaming (2,720.7 MB/s), optimized video processing (1,342 MB/s), and enterprise-grade security.
arXiv Detail & Related papers (2025-11-19T04:10:32Z)
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z)
Anti-Regulatory AI: How "AI Safety" is Leveraged Against Regulatory Oversight [0.9883261192383612]
AI companies increasingly develop and deploy privacy-enhancing technologies, bias-constraining measures, evaluation frameworks, and alignment techniques.<n>This paper examines the ulterior function of these technologies as mechanisms of legal influence.
arXiv Detail & Related papers (2025-09-26T19:35:10Z)
Rethinking Data Protection in the (Generative) Artificial Intelligence Era [138.07763415496288]
We propose a four-level taxonomy that captures the diverse protection needs arising in modern (generative) AI models and systems.<n>Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline.
arXiv Detail & Related papers (2025-07-03T02:45:51Z)
CoTGuard: Using Chain-of-Thought Triggering for Copyright Protection in Multi-Agent LLM Systems [55.57181090183713]
We introduce CoTGuard, a novel framework for copyright protection that leverages trigger-based detection within Chain-of-Thought reasoning.<n>Specifically, we can activate specific CoT segments and monitor intermediate reasoning steps for unauthorized content reproduction by embedding specific trigger queries into agent prompts.<n>This approach enables fine-grained, interpretable detection of copyright violations in collaborative agent scenarios.
arXiv Detail & Related papers (2025-05-26T01:42:37Z)
Governing AI Beyond the Pretraining Frontier [0.0]
This year, jurisdictions worldwide, including the United States, the European Union, the United Kingdom, and China, are set to enact or revise laws governing frontier AI.<n>Yet growing evidence suggests that this "pretraining paradigm" may be hitting a wall and major AI companies are turning to alternative approaches.<n>This essay seeks to identify these challenges and point to new paths forward for regulation.
arXiv Detail & Related papers (2025-01-27T16:25:03Z)
Position: Mind the Gap-the Growing Disconnect Between Established Vulnerability Disclosure and AI Security [56.219994752894294]
We argue that adapting existing processes for AI security reporting is doomed to fail due to fundamental shortcomings for the distinctive characteristics of AI systems.<n>Based on our proposal to address these shortcomings, we discuss an approach to AI security reporting and how the new AI paradigm, AI agents, will further reinforce the need for specialized AI security incident reporting advancements.
arXiv Detail & Related papers (2024-12-19T13:50:26Z)
Uncertain Boundaries: Multidisciplinary Approaches to Copyright Issues in Generative AI [2.2780130786778665]
Generative AI models generating near-replicas of copyrighted material highlight the need to adapt current legal frameworks.<n>Most existing research on copyright in AI takes a purely computer science or law-based approach.<n>This survey adopts a comprehensive approach synthesizing insights from law, policy, economics, and computer science.
arXiv Detail & Related papers (2024-03-31T22:10:01Z)
A Survey and Comparative Analysis of Security Properties of CAN Authentication Protocols [92.81385447582882]
The Controller Area Network (CAN) bus leaves in-vehicle communications inherently non-secure. This paper reviews and compares the 15 most prominent authentication protocols for the CAN bus. We evaluate protocols based on essential operational criteria that contribute to ease of implementation.
arXiv Detail & Related papers (2024-01-19T14:52:04Z)
The risks of risk-based AI regulation: taking liability seriously [46.90451304069951]
The development and regulation of AI seems to have reached a critical stage. Some experts are calling for a moratorium on the training of AI systems more powerful than GPT-4. This paper analyses the most advanced legal proposal, the European Union's AI Act.
arXiv Detail & Related papers (2023-11-03T12:51:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.