Deep Active Learning with Crowdsourcing Data for Privacy Policy
Classification
- URL: http://arxiv.org/abs/2008.02954v1
- Date: Fri, 7 Aug 2020 02:13:31 GMT
- Title: Deep Active Learning with Crowdsourcing Data for Privacy Policy
Classification
- Authors: Wenjun Qiu and David Lie
- Abstract summary: Active learning and crowdsourcing techniques are used to develop an automated classification tool named Calpric.
Calpric is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost.
Our model is able to achieve the same F1 score using only 62% of the original labeling effort.
- Score: 6.5443502434659955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Privacy policies are statements that notify users of the services' data
practices. However, few users are willing to read through policy texts due to
the length and complexity. While automated tools based on machine learning
exist for privacy policy analysis, to achieve high classification accuracy,
classifiers need to be trained on a large labeled dataset. Most existing policy
corpora are labeled by skilled human annotators, requiring significant amount
of labor hours and effort. In this paper, we leverage active learning and
crowdsourcing techniques to develop an automated classification tool named
Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is
able to perform annotation equivalent to those done by skilled human annotators
with high accuracy while minimizing the labeling cost. Specifically, active
learning allows classifiers to proactively select the most informative segments
to be labeled. On average, our model is able to achieve the same F1 score using
only 62% of the original labeling effort. Calpric's use of active learning also
addresses naturally occurring class imbalance in unlabeled privacy policy
datasets as there are many more statements stating the collection of private
information than stating the absence of collection. By selecting samples from
the minority class for labeling, Calpric automatically creates a more balanced
training set.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with
Crowdsourcing and Active Learning [5.279873919047532]
We present Calpric, which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost.
Calpric's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.
arXiv Detail & Related papers (2024-01-16T01:27:26Z) - Exploring Vacant Classes in Label-Skewed Federated Learning [113.65301899666645]
Label skews, characterized by disparities in local label distribution across clients, pose a significant challenge in federated learning.
This paper introduces FedVLS, a novel approach to label-skewed federated learning that integrates vacant-class distillation and logit suppression simultaneously.
arXiv Detail & Related papers (2024-01-04T16:06:31Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Dominant Set-based Active Learning for Text Classification and its
Application to Online Social Media [0.0]
We present a novel pool-based active learning method for the training of large unlabeled corpus with minimum annotation cost.
Our proposed method does not have any parameters to be tuned, making it dataset-independent.
Our method achieves a higher performance in comparison to the state-of-the-art active learning strategies.
arXiv Detail & Related papers (2022-01-28T19:19:03Z) - Labels, Information, and Computation: Efficient, Privacy-Preserving
Learning Using Sufficient Labels [0.0]
We show that we do not always need full label information on every single training example.
We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency.
sufficiently-labeled data naturally preserves user privacy by storing relative, instead of absolute, information.
arXiv Detail & Related papers (2021-04-19T02:15:25Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - A Pitfall of Learning from User-generated Data: In-depth Analysis of
Subjective Class Problem [1.218340575383456]
We propose two types of classes in user-defined labels: subjective class and objective class.
We define this as a subjective class issue and provide a framework for detecting subjective labels in a dataset without oracle querying.
arXiv Detail & Related papers (2020-03-24T02:25:52Z) - R\'{e}nyi Entropy Bounds on the Active Learning Cost-Performance
Tradeoff [27.436483977171328]
Semi-supervised classification studies how to combine the statistical knowledge of the often abundant unlabeled data with the often limited labeled data in order to maximize overall classification accuracy.
In this paper, we initiate the non-asymptotic analysis of the optimal policy for semi-supervised classification with actively obtained labeled data.
We provide the first characterization of the jointly optimal active learning and semi-supervised classification policy, in terms of the cost-performance tradeoff driven by the label query budget and overall classification accuracy.
arXiv Detail & Related papers (2020-02-05T22:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.