PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research
- URL: http://arxiv.org/abs/2508.09232v2
- Date: Thu, 16 Oct 2025 07:38:09 GMT
- Title: PETLP: A Privacy-by-Design Pipeline for Social Media Data in AI Research
- Authors: Nick Oh, Giorgos D. Vrakas, Siân J. M. Brooke, Sasha Morinière, Toju Duke,
- Abstract summary: PETLP (Privacy-by-design Extract, Transform, Load, and Present) is a compliance framework that embeds legal safeguards directly into extended pipelines.<n>We demonstrate how extraction rights fundamentally differ between qualifying research organisations.<n>We show why true anonymisation remains unachievable for social media data.
- Score: 2.185322080975722
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Social media data presents AI researchers with overlapping obligations under the GDPR, copyright law, and platform terms -- yet existing frameworks fail to integrate these regulatory domains, leaving researchers without unified guidance. We introduce PETLP (Privacy-by-design Extract, Transform, Load, and Present), a compliance framework that embeds legal safeguards directly into extended ETL pipelines. Central to PETLP is treating Data Protection Impact Assessments as living documents that evolve from pre-registration through dissemination. Through systematic Reddit analysis, we demonstrate how extraction rights fundamentally differ between qualifying research organisations (who can invoke DSM Article 3 to override platform restrictions) and commercial entities (bound by terms of service), whilst GDPR obligations apply universally. We demonstrate why true anonymisation remains unachievable for social media data and expose the legal gap between permitted dataset creation and uncertain model distribution. By structuring compliance decisions into practical workflows and simplifying institutional data management plans, PETLP enables researchers to navigate regulatory complexity with confidence, bridging the gap between legal requirements and research practice.
Related papers
- Compliance Management for Federated Data Processing [1.3836910960262496]
Federated data processing (FDP) offers a promising approach for enabling collaborative analysis of sensitive data without centralizing raw datasets.<n>We present a framework for compliance-aware FDP that integrates policy-as-code, workflow orchestration, and large language model (LLM)-assisted compliance management.
arXiv Detail & Related papers (2026-02-22T22:10:25Z) - LegalOne: A Family of Foundation Models for Reliable Legal Reasoning [54.57434222018289]
We present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain.<n>LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning.<n>We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI.
arXiv Detail & Related papers (2026-01-31T10:18:32Z) - Global AI Governance Overview: Understanding Regulatory Requirements Across Global Jurisdictions [0.0]
The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data.<n>This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region.<n>It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development.
arXiv Detail & Related papers (2025-11-26T13:59:11Z) - Analyzing and Internalizing Complex Policy Documents for LLM Agents [53.14898416858099]
Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules.<n>This motivates developing internalization methods that embed policy documents into model priors while preserving performance.<n>We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels.
arXiv Detail & Related papers (2025-10-13T16:30:07Z) - Policy-Driven AI in Dataspaces: Taxonomy, Explainability, and Pathways for Compliant Innovation [1.6766200616088744]
This paper provides a comprehensive review of privacy-preserving and policy-aware AI techniques.<n>We propose a novel taxonomy to classify these techniques based on privacy levels, impacts, and compliance complexity.<n>By technical, ethical, and regulatory perspectives, this work lays the groundwork for developing trustworthy, efficient, and compliant AI systems in dataspaces.
arXiv Detail & Related papers (2025-07-26T17:07:01Z) - The Accountability Paradox: How Platform API Restrictions Undermine AI Transparency Mandates [0.0]
API restrictions on major social media platforms challenge compliance with the EU Digital Services Act [20], which mandates data access for algorithmic transparency.<n>We develop a structured audit framework to assess the growing misalignment between regulatory requirements and platform implementations.<n>We propose targeted policy interventions aligned with the AI Risk Management Framework of the National Institute of Standards and Technology.
arXiv Detail & Related papers (2025-05-16T14:30:20Z) - Lawful and Accountable Personal Data Processing with GDPR-based Access and Usage Control in Distributed Systems [0.0]
This paper proposes a case-generic method for automated normative reasoning that establishes legal arguments for the lawfulness of data processing activities.<n>The arguments are established on the basis of case-specific legal qualifications made by privacy experts, bringing the human in the loop.<n>The resulting system is designed and critically assessed in reference to requirements extracted from the GPDR.
arXiv Detail & Related papers (2025-03-10T10:49:34Z) - Demystifying Legalese: An Automated Approach for Summarizing and Analyzing Overlaps in Privacy Policies and Terms of Service [0.6240153531166704]
Our work seeks to alleviate this issue by developing language models that provide automated, accessible summaries and scores for such documents.
We compared transformer-based and conventional models during training on our dataset, and RoBERTa performed better overall with a remarkable 0.74 F1-score.
arXiv Detail & Related papers (2024-04-17T19:53:59Z) - Federated Learning Priorities Under the European Union Artificial
Intelligence Act [68.44894319552114]
We perform a first-of-its-kind interdisciplinary analysis (legal and ML) of the impact the AI Act may have on Federated Learning.
We explore data governance issues and the concern for privacy.
Most noteworthy are the opportunities to defend against data bias and enhance private and secure computation.
arXiv Detail & Related papers (2024-02-05T19:52:19Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - Having your Privacy Cake and Eating it Too: Platform-supported Auditing
of Social Media Algorithms for Public Interest [70.02478301291264]
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse.
Prior studies have used black-box methods to show that these algorithms can lead to biased or discriminatory outcomes.
We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation.
arXiv Detail & Related papers (2022-07-18T17:32:35Z) - Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data.
FL and related techniques are often described as privacy-preserving.
We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z) - Learning to Limit Data Collection via Scaling Laws: Data Minimization
Compliance in Practice [62.44110411199835]
We build on literature in machine learning law to propose framework for limiting collection based on data interpretation that ties data to system performance.
We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique.
arXiv Detail & Related papers (2021-07-16T19:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.