Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with
Crowdsourcing and Active Learning
- URL: http://arxiv.org/abs/2401.08038v1
- Date: Tue, 16 Jan 2024 01:27:26 GMT
- Title: Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with
Crowdsourcing and Active Learning
- Authors: Wenjun Qiu, David Lie, and Lisa Austin
- Abstract summary: We present Calpric, which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost.
Calpric's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.
- Score: 5.279873919047532
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A significant challenge to training accurate deep learning models on privacy
policies is the cost and difficulty of obtaining a large and comprehensive set
of training data. To address these challenges, we present Calpric , which
combines automatic text selection and segmentation, active learning and the use
of crowdsourced annotators to generate a large, balanced training set for
privacy policies at low cost. Automated text selection and segmentation
simplifies the labeling task, enabling untrained annotators from crowdsourcing
platforms, like Amazon's Mechanical Turk, to be competitive with trained
annotators, such as law students, and also reduces inter-annotator agreement,
which decreases labeling cost. Having reliable labels for training enables the
use of active learning, which uses fewer training samples to efficiently cover
the input space, further reducing cost and improving class and data category
balance in the data set. The combination of these techniques allows Calpric to
produce models that are accurate over a wider range of data categories, and
provide more detailed, fine-grain labels than previous work. Our crowdsourcing
process enables Calpric to attain reliable labeled data at a cost of roughly
$0.92-$1.71 per labeled text segment. Calpric 's training process also
generates a labeled data set of 16K privacy policy text segments across 9 Data
categories with balanced positive and negative samples.
Related papers
- Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators.
We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint
Localization [88.74813798138466]
Localizing keypoints of an object is a basic visual problem.
Supervised learning of a keypoint localization network often requires a large amount of data.
We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds.
arXiv Detail & Related papers (2022-01-21T09:51:58Z) - Labels, Information, and Computation: Efficient, Privacy-Preserving
Learning Using Sufficient Labels [0.0]
We show that we do not always need full label information on every single training example.
We call this statistic "sufficiently-labeled data" and prove its sufficiency and efficiency.
sufficiently-labeled data naturally preserves user privacy by storing relative, instead of absolute, information.
arXiv Detail & Related papers (2021-04-19T02:15:25Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - Active Learning for Noisy Data Streams Using Weak and Strong Labelers [3.9370369973510746]
We consider a novel weak and strong labeler problem inspired by humans natural ability for labeling.
We propose an on-line active learning algorithm that consists of four steps: filtering, adding diversity, informative sample selection, and labeler selection.
We derive a decision function that measures the information gain by combining the informativeness of individual samples and model confidence.
arXiv Detail & Related papers (2020-10-27T09:18:35Z) - Deep Active Learning with Crowdsourcing Data for Privacy Policy
Classification [6.5443502434659955]
Active learning and crowdsourcing techniques are used to develop an automated classification tool named Calpric.
Calpric is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost.
Our model is able to achieve the same F1 score using only 62% of the original labeling effort.
arXiv Detail & Related papers (2020-08-07T02:13:31Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Minimum Cost Active Labeling [2.0754848504005587]
min-cost labeling uses a variant of active learning to learn a model to predict the optimal training set size.
In some cases, our approach has 6X lower overall cost relative to human labeling, and is always cheaper than the cheapest active learning strategy.
arXiv Detail & Related papers (2020-06-24T19:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.