A Learning oriented DLP System based on Classification Model
- URL: http://arxiv.org/abs/2312.13711v1
- Date: Thu, 21 Dec 2023 10:23:16 GMT
- Title: A Learning oriented DLP System based on Classification Model
- Authors: Kishu Gupta, Ashwani Kush
- Abstract summary: Data leakage is the most critical issue being faced by organizations.
In order to mitigate the data leakage issues data leakage prevention systems (DLPSs) are deployed at various levels by the organizations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is the key asset for organizations and data sharing is lifeline for
organization growth; which may lead to data loss. Data leakage is the most
critical issue being faced by organizations. In order to mitigate the data
leakage issues data leakage prevention systems (DLPSs) are deployed at various
levels by the organizations. DLPSs are capable to protect all kind of data i.e.
DAR, DIM/DIT, DIU. Statistical analysis, regular expression, data
fingerprinting are common approaches exercised in DLP system. Out of these
techniques; statistical analysis approach is most appropriate for proposed DLP
model of data security. This paper defines a statistical DLP model for document
classification. Model uses various statistical approaches like TF-IDF (Term
Frequency- Inverse Document Frequency) a renowned term count/weighing function,
Vectorization, Gradient boosting document classification etc. to classify the
documents before allowing any access to it. Machine learning is used to test
and train the model. Proposed model also introduces an extremely efficient and
more accurate approach; IGBCA (Improvised Gradient Boosting Classification
Algorithm); for document classification, to prevent them from possible data
leakage. Results depicts that proposed model can classify documents with high
accuracy and on basis of which data can be prevented from being loss.
Related papers
- Adaptive Domain Inference Attack [6.336458796079136]
Existing model-targeted attacks assume the attacker has known the application domain or training data distribution.
Can removing the domain information from model APIs protect models from these attacks?
A proposed adaptive domain inference attack (ADI) can still successfully estimate relevant subsets of training data.
arXiv Detail & Related papers (2023-12-22T22:04:13Z) - From Zero to Hero: Detecting Leaked Data through Synthetic Data Injection and Model Querying [10.919336198760808]
We introduce a novel methodology to detect leaked data that are used to train classification models.
textscLDSS involves injecting a small volume of synthetic data--characterized by local shifts in class distribution--into the owner's dataset.
This enables the effective identification of models trained on leaked data through model querying alone.
arXiv Detail & Related papers (2023-10-06T10:36:28Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - The Word is Mightier than the Label: Learning without Pointillistic
Labels using Data Programming [11.536162323162099]
Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples.
Hand-labelling vast amounts of data may be tedious, expensive, and error-prone.
arXiv Detail & Related papers (2021-08-24T19:11:28Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Gradient-based Data Subversion Attack Against Binary Classifiers [9.414651358362391]
In this work, we focus on label contamination attack in which an attacker poisons the labels of data to compromise the functionality of the system.
We exploit the gradients of a differentiable convex loss function with respect to the predicted label as a warm-start and formulate different strategies to find a set of data instances to contaminate.
Our experiments show that the proposed approach outperforms the baselines and is computationally efficient.
arXiv Detail & Related papers (2021-05-31T09:04:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.