Related papers: An Empirical Study of Sensitive Information in Logs

An Empirical Study of Sensitive Information in Logs

URL: http://arxiv.org/abs/2409.11313v1
Date: Tue, 17 Sep 2024 16:12:23 GMT
Title: An Empirical Study of Sensitive Information in Logs
Authors: Roozbeh Aghili, Heng Li, Foutse Khomh,
Abstract summary: Presence of sensitive information in software logs poses significant privacy concerns. This study offers a comprehensive analysis of privacy in software logs from multiple perspectives. Our findings shed light on various perspectives of log privacy and reveal industry challenges.
Score: 12.980238412281471
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software logs, generated during the runtime of software systems, are essential for various development and analysis activities, such as anomaly detection and failure diagnosis. However, the presence of sensitive information in these logs poses significant privacy concerns, particularly regarding Personally Identifiable Information (PII) and quasi-identifiers that could lead to re-identification risks. While general data privacy has been extensively studied, the specific domain of privacy in software logs remains underexplored, with inconsistent definitions of sensitivity and a lack of standardized guidelines for anonymization. To mitigate this gap, this study offers a comprehensive analysis of privacy in software logs from multiple perspectives. We start by performing an analysis of 25 publicly available log datasets to identify potentially sensitive attributes. Based on the result of this step, we focus on three perspectives: privacy regulations, research literature, and industry practices. We first analyze key data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), to understand the legal requirements concerning sensitive information in logs. Second, we conduct a systematic literature review to identify common privacy attributes and practices in log anonymization, revealing gaps in existing approaches. Finally, we survey 45 industry professionals to capture practical insights on log anonymization practices. Our findings shed light on various perspectives of log privacy and reveal industry challenges, such as technical and efficiency issues while highlighting the need for standardized guidelines. By combining insights from regulatory, academic, and industry perspectives, our study aims to provide a clearer framework for identifying and protecting sensitive information in software logs.

Related papers

SDLog: A Deep Learning Framework for Detecting Sensitive Information in Software Logs [11.882006416295098]
We introduce SDLog, a framework designed to identify sensitive information in software logs.<n>With only 100 fine-tuning samples from the target dataset, SDLog can correctly identify 99.5% of sensitive attributes and an F1-score of 98.4%.
arXiv Detail & Related papers (2025-05-20T23:36:13Z)
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z)
Secure Visual Data Processing via Federated Learning [2.4374097382908477]
This paper addresses the need for privacy-preserving solutions in large-scale visual data processing.<n>We propose a new approach that combines object detection, federated learning and anonymization.<n>Our solution is evaluated against traditional centralized models, showing that while there is a slight trade-off in accuracy, the privacy benefits are substantial.
arXiv Detail & Related papers (2025-02-09T09:44:18Z)
The Good, the Bad, and the (Un)Usable: A Rapid Literature Review on Privacy as Code [4.479352653343731]
Privacy and security are central to the design of information systems endowed with sound data protection and cyber resilience capabilities. Developers often struggle to incorporate these properties into software projects as they either lack proper cybersecurity training or do not consider them a priority.
arXiv Detail & Related papers (2024-12-21T15:30:17Z)
Privacy in Fine-tuning Large Language Models: Attacks, Defenses, and Future Directions [11.338466798715906]
Fine-tuning Large Language Models (LLMs) can achieve state-of-the-art performance across various domains. This paper provides a comprehensive survey of privacy challenges associated with fine-tuning LLMs. We highlight vulnerabilities to various privacy attacks, including membership inference, data extraction, and backdoor attacks.
arXiv Detail & Related papers (2024-12-21T06:41:29Z)
Privacy-Preserving Redaction of Diagnosis Data through Source Code Analysis [4.721903499874626]
We argue for a source code analysis approach for log redaction. To identify a log message containing sensitive information, our method locates the corresponding log statement in the source code.
arXiv Detail & Related papers (2024-09-26T04:41:55Z)
Collection, usage and privacy of mobility data in the enterprise and public administrations [55.2480439325792]
Security measures such as anonymization are needed to protect individuals' privacy. Within our study, we conducted expert interviews to gain insights into practices in the field. We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy.
arXiv Detail & Related papers (2024-07-04T08:29:27Z)
A Summary of Privacy-Preserving Data Publishing in the Local Setting [0.6749750044497732]
Statistical Disclosure Control aims to minimize the risk of exposing confidential information by de-identifying it. We outline the current privacy-preserving techniques employed in microdata de-identification, delve into privacy measures tailored for various disclosure scenarios, and assess metrics for information loss and predictive performance.
arXiv Detail & Related papers (2023-12-19T04:23:23Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
Security and Privacy on Generative Data in AIGC: A Survey [17.456578314457612]
We review the security and privacy on generative data in AIGC. We reveal the successful experiences of state-of-the-art countermeasures in terms of the foundational properties of privacy, controllability, authenticity, and compliance.
arXiv Detail & Related papers (2023-09-18T02:35:24Z)
Helping Code Reviewer Prioritize: Pinpointing Personal Data and its Processing [0.9238700679836852]
We have designed two specialized views to help code reviewers in prioritizing their work related to personal data. Our approach, evaluated on four open-source GitHub applications, demonstrated a precision rate of 0.87 in identifying personal data flows. This solution, designed to augment the efficiency of privacy-related analysis tasks such as the Record of Processing Activities (ROPA), aims to conserve resources, thereby saving time and enhancing productivity for code reviewers.
arXiv Detail & Related papers (2023-06-20T12:30:46Z)
How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss. We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z)
Private Domain Adaptation from a Public Source [48.83724068578305]
We design differentially private discrepancy-based algorithms for adaptation from a source domain with public labeled data to a target domain with unlabeled private data. Our solutions are based on private variants of Frank-Wolfe and Mirror-Descent algorithms.
arXiv Detail & Related papers (2022-08-12T06:52:55Z)
Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data. FL and related techniques are often described as privacy-preserving. We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z)
Reinforcement Learning on Encrypted Data [58.39270571778521]
We present a preliminary, experimental study of how a DQN agent trained on encrypted states performs in environments with discrete and continuous state spaces. Our results highlight that the agent is still capable of learning in small state spaces even in presence of non-deterministic encryption, but performance collapses in more complex environments.
arXiv Detail & Related papers (2021-09-16T21:59:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.