Scalable Data Classification for Security and Privacy
- URL: http://arxiv.org/abs/2006.14109v5
- Date: Mon, 6 Jul 2020 20:03:21 GMT
- Title: Scalable Data Classification for Security and Privacy
- Authors: Paulo Tanaka, Sameet Sapra, Nikolay Laptev
- Abstract summary: This paper is about an end-to-end system built to detect sensitive semantic types within Facebook at scale.
The described system is in production achieving a 0.9+ average F2 scores across various privacy classes.
- Score: 0.06445605125467573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Content based data classification is an open challenge. Traditional Data Loss
Prevention (DLP)-like systems solve this problem by fingerprinting the data in
question and monitoring endpoints for the fingerprinted data. With a large
number of constantly changing data assets in Facebook, this approach is both
not scalable and ineffective in discovering what data is where. This paper is
about an end-to-end system built to detect sensitive semantic types within
Facebook at scale and enforce data retention and access controls automatically.
The approach described here is our first end-to-end privacy system that
attempts to solve this problem by incorporating data signals, machine learning,
and traditional fingerprinting techniques to map out and classify all data
within Facebook. The described system is in production achieving a 0.9+ average
F2 scores across various privacy classes while handling a large number of data
assets across dozens of data stores.
Related papers
- Fingerprinting and Tracing Shadows: The Development and Impact of Browser Fingerprinting on Digital Privacy [55.2480439325792]
Browser fingerprinting is a growing technique for identifying and tracking users online without traditional methods like cookies.
This paper gives an overview by examining the various fingerprinting techniques and analyzes the entropy and uniqueness of the collected data.
arXiv Detail & Related papers (2024-11-18T20:32:31Z) - FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation [4.772368796656325]
In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments.
We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task.
arXiv Detail & Related papers (2024-10-30T02:41:26Z) - Footprints of Data in a Classifier Model: The Privacy Issues and Their Mitigation through Data Obfuscation [0.9208007322096533]
embedding of footprints of training data in a prediction model is one such facet.
difference in performance quality in test and training data causes passive identification of data that have trained the model.
This research focuses on addressing the vulnerability arising from the data footprints.
arXiv Detail & Related papers (2024-07-02T13:56:37Z) - When Graph Convolution Meets Double Attention: Online Privacy Disclosure Detection with Multi-Label Text Classification [6.700420953065072]
It is important to detect such unwanted privacy disclosures to help alert people affected and the online platform.
In this paper, privacy disclosure detection is modeled as a multi-label text classification problem.
A new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures.
arXiv Detail & Related papers (2023-11-27T15:25:17Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - Privacy Side Channels in Machine Learning Systems [87.53240071195168]
We introduce privacy side channels: attacks that exploit system-level components to extract private information.
For example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees.
We further show that systems which block language models from regenerating training data can be exploited to exfiltrate private keys contained in the training set.
arXiv Detail & Related papers (2023-09-11T16:49:05Z) - Privacy-Preserving Graph Machine Learning from Data to Computation: A
Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning.
We first review methods for generating privacy-preserving graph data.
Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - A Survey on Differential Privacy with Machine Learning and Future
Outlook [0.0]
differential privacy is used to protect machine learning models from any attacks and vulnerabilities.
This survey paper presents different differentially private machine learning algorithms categorized into two main categories.
arXiv Detail & Related papers (2022-11-19T14:20:53Z) - BeeTrace: A Unified Platform for Secure Contact Tracing that Breaks Data
Silos [73.84437456144994]
Contact tracing is an important method to control the spread of an infectious disease such as COVID-19.
Current solutions do not utilize the huge volume of data stored in business databases and individual digital devices.
We propose BeeTrace, a unified platform that breaks data silos and deploys state-of-the-art cryptographic protocols to guarantee privacy goals.
arXiv Detail & Related papers (2020-07-05T10:33:45Z) - Security and Privacy Preserving Deep Learning [2.322461721824713]
Massive data collection required for deep learning presents obvious privacy issues.
Users personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it.
Deep neural networks are susceptible to various inference attacks as they remember information about their training data.
arXiv Detail & Related papers (2020-06-23T01:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.