The whos, whats, and whys of issues related to personal data and data protection in open-source projects on GitHub
- URL: http://arxiv.org/abs/2304.06367v2
- Date: Tue, 14 Oct 2025 15:30:01 GMT
- Title: The whos, whats, and whys of issues related to personal data and data protection in open-source projects on GitHub
- Authors: Anne Henning, Lukas Schulte, Steffen Herbold, Oksana Kulyk, Peter Mayer,
- Abstract summary: We use inductive coding to analyze 652 issues from Open Source GitHub projects.<n>We observed a significant increase in reporting when data protection regulations came into effect.<n>All in all our findings indicate data protection regulations effectively start discussions about privacy software development community.
- Score: 6.733786687734259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data protection regulations such as the General Data Protection Regulation (GDPR) in the European Union and the California Consumer Privacy Act (CCPA) in the US affect how software may handle the personal data of its users. Prior literature focused on how data protection regulations are discussed for software in operation, or how this topic is discussed in various channels outside of the software development process. Yet, what is missing, is a perspective on the impact of such regulations on the software development process. In our work, we address this gap, and explore how discussions during the development of software are impacted by regulations, who reports and discusses issues related to personal data and data protection, and how developers react to those issues. To that end, we used inductive coding to analyze 652 issues from Open Source GitHub projects and used the codes to quantitatively analyze the relation between the roles, resolutions, and data protection issues to understand correlations and predict resolutions of issues. Most notably we observed a significant increase in reporting when GDPR came into effect. The most common issue types were feature requests for privacy enhancement, which were mainly reported and discussed by frequent reporters and frequent committers. But especially issues regarding privacy enhancement were also frequently reported by one-time reporters. Most of the requests were solved without opposing votes. All in all, our findings indicate that data protection regulations effectively start discussions about privacy within the software development community.
Related papers
- Analyzing developer discussions on EU and US privacy legislation compliance in GitHub repositories [12.041470749136488]
EU General Data Protection Regulation (EU General Data Protection Regulation) and the California Consumer Privacy Act (CCPA) have forced the community to focus on users' data privacy.<n>Despite the vast amount of developer issues available in GitHub repositories there is a lack of empirical evidence on issues developers of Open Source Software comply with privacy legislation.<n>We devised 24 discussion categories placed in six clusters: features/bugs, consent-related, documentation, data/sharing, storing, and general compliance.
arXiv Detail & Related papers (2025-12-11T13:16:20Z) - "I need to learn better searching tactics for privacy policy laws.'' Investigating Software Developers' Behavior When Using Sources on Privacy Issues [8.662963983664223]
Our study highlights major shortcomings in existing support for privacy-related development tasks.<n>Based on our findings, we discuss the need for more accessible, understandable, and actionable privacy resources for developers.
arXiv Detail & Related papers (2025-11-11T09:58:06Z) - Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era [2.1331883629523634]
We analyzed 37,213 commits from 12,391 repositories since 2016, whereas 594 commits from 70 most popular repositories dataset were manually analyzed.<n>We observe most commits were performed on the year the law came into effect and privacy relevant terms appear in the commit messages.<n>The study showed that more educational activities on privacy user rights are needed, as well as tools for privacy recommendations.
arXiv Detail & Related papers (2025-05-28T11:10:58Z) - Protecting Privacy in Software Logs: What Should Be Anonymized? [12.980238412281471]
Presence of sensitive information in software logs poses significant privacy concerns.<n>This study offers a comprehensive analysis of privacy in software logs from multiple perspectives.<n>Our findings shed light on various perspectives of log privacy and reveal industry challenges.
arXiv Detail & Related papers (2024-09-17T16:12:23Z) - PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories.<n>We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds.<n>State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z) - An Exploratory Mixed-Methods Study on General Data Protection Regulation (GDPR) Compliance in Open-Source Software [4.2610816955137]
European Union's General Data Protection Regulation require software developers to meet privacy requirements interacting with users' data.
Prior research describes impact of such laws on development, but only when commercial software.
arXiv Detail & Related papers (2024-06-20T20:38:33Z) - A First Look at the General Data Protection Regulation (GDPR) in
Open-Source Software [4.844017045823075]
This poster describes work on regulated data protection in opensource software.
We surveyed open-source developers to understand their experiences.
We call for improved policy-related compliance resources.
arXiv Detail & Related papers (2024-01-26T03:49:13Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data.
FL and related techniques are often described as privacy-preserving.
We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z) - Decision Making with Differential Privacy under a Fairness Lens [65.16089054531395]
The U.S. Census Bureau releases data sets and statistics about groups of individuals that are used as input to a number of critical decision processes.
To conform to privacy and confidentiality requirements, these agencies are often required to release privacy-preserving versions of the data.
This paper studies the release of differentially private data sets and analyzes their impact on some critical resource allocation tasks under a fairness perspective.
arXiv Detail & Related papers (2021-05-16T21:04:19Z) - Second layer data governance for permissioned blockchains: the privacy
management challenge [58.720142291102135]
In pandemic situations, such as the COVID-19 and Ebola outbreak, the action related to sharing health data is crucial to avoid the massive infection and decrease the number of deaths.
In this sense, permissioned blockchain technology emerges to empower users to get their rights providing data ownership, transparency, and security through an immutable, unified, and distributed database ruled by smart contracts.
arXiv Detail & Related papers (2020-10-22T13:19:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.