Related papers: Scalable Data Classification for Security and Privacy

Scalable Data Classification for Security and Privacy

URL: http://arxiv.org/abs/2006.14109v5
Date: Mon, 6 Jul 2020 20:03:21 GMT
Title: Scalable Data Classification for Security and Privacy
Authors: Paulo Tanaka, Sameet Sapra, Nikolay Laptev
Abstract summary: This paper is about an end-to-end system built to detect sensitive semantic types within Facebook at scale. The described system is in production achieving a 0.9+ average F2 scores across various privacy classes.
Score: 0.06445605125467573
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Content based data classification is an open challenge. Traditional Data Loss Prevention (DLP)-like systems solve this problem by fingerprinting the data in question and monitoring endpoints for the fingerprinted data. With a large number of constantly changing data assets in Facebook, this approach is both not scalable and ineffective in discovering what data is where. This paper is about an end-to-end system built to detect sensitive semantic types within Facebook at scale and enforce data retention and access controls automatically. The approach described here is our first end-to-end privacy system that attempts to solve this problem by incorporating data signals, machine learning, and traditional fingerprinting techniques to map out and classify all data within Facebook. The described system is in production achieving a 0.9+ average F2 scores across various privacy classes while handling a large number of data assets across dozens of data stores.

Related papers

Benchmarking Fraud Detectors on Private Graph Data [70.4654745317714]
Currently, many types of fraud are managed in part by automated detection algorithms that operate over graphs.<n>We consider the scenario where a data holder wishes to outsource development of fraud detectors to third parties.<n>Third parties submit their fraud detectors to the data holder, who evaluates these algorithms on a private dataset and then publicly communicates the results.<n>We propose a realistic privacy attack on this system that allows an adversary to de-anonymize individuals' data based only on the evaluation results.
arXiv Detail & Related papers (2025-07-30T03:20:15Z)
Fingerprinting and Tracing Shadows: The Development and Impact of Browser Fingerprinting on Digital Privacy [55.2480439325792]
Browser fingerprinting is a growing technique for identifying and tracking users online without traditional methods like cookies. This paper gives an overview by examining the various fingerprinting techniques and analyzes the entropy and uniqueness of the collected data.
arXiv Detail & Related papers (2024-11-18T20:32:31Z)
FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation [4.772368796656325]
In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task.
arXiv Detail & Related papers (2024-10-30T02:41:26Z)
Footprints of Data in a Classifier Model: The Privacy Issues and Their Mitigation through Data Obfuscation [0.9208007322096533]
embedding of footprints of training data in a prediction model is one such facet. difference in performance quality in test and training data causes passive identification of data that have trained the model. This research focuses on addressing the vulnerability arising from the data footprints.
arXiv Detail & Related papers (2024-07-02T13:56:37Z)
Blind-Touch: Homomorphic Encryption-Based Distributed Neural Network Inference for Privacy-Preserving Fingerprint Authentication [8.61984933438984]
Homomorphic encryption allows computations encrypted on data without decrypting. Blind-Touch can keep fingerprint data encrypted on the server while performing machine learning operations. Blind-Touch achieves high accuracy on two benchmark fingerprint datasets.
arXiv Detail & Related papers (2023-12-18T09:05:34Z)
When Graph Convolution Meets Double Attention: Online Privacy Disclosure Detection with Multi-Label Text Classification [6.700420953065072]
It is important to detect such unwanted privacy disclosures to help alert people affected and the online platform. In this paper, privacy disclosure detection is modeled as a multi-label text classification problem. A new privacy disclosure detection model is proposed to construct an MLTC classifier for detecting online privacy disclosures.
arXiv Detail & Related papers (2023-11-27T15:25:17Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
Privacy Side Channels in Machine Learning Systems [87.53240071195168]
We introduce privacy side channels: attacks that exploit system-level components to extract private information. For example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees. We further show that systems which block language models from regenerating training data can be exploited to exfiltrate private keys contained in the training set.
arXiv Detail & Related papers (2023-09-11T16:49:05Z)
Privacy-Preserving Graph Machine Learning from Data to Computation: A Survey [67.7834898542701]
We focus on reviewing privacy-preserving techniques of graph machine learning. We first review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information.
arXiv Detail & Related papers (2023-07-10T04:30:23Z)
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination. We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z)
A Survey on Differential Privacy with Machine Learning and Future Outlook [0.0]
differential privacy is used to protect machine learning models from any attacks and vulnerabilities. This survey paper presents different differentially private machine learning algorithms categorized into two main categories.
arXiv Detail & Related papers (2022-11-19T14:20:53Z)
BeeTrace: A Unified Platform for Secure Contact Tracing that Breaks Data Silos [73.84437456144994]
Contact tracing is an important method to control the spread of an infectious disease such as COVID-19. Current solutions do not utilize the huge volume of data stored in business databases and individual digital devices. We propose BeeTrace, a unified platform that breaks data silos and deploys state-of-the-art cryptographic protocols to guarantee privacy goals.
arXiv Detail & Related papers (2020-07-05T10:33:45Z)
Security and Privacy Preserving Deep Learning [2.322461721824713]
Massive data collection required for deep learning presents obvious privacy issues. Users personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it. Deep neural networks are susceptible to various inference attacks as they remember information about their training data.
arXiv Detail & Related papers (2020-06-23T01:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.