Automated PII Extraction from Social Media for Raising Privacy
Awareness: A Deep Transfer Learning Approach
- URL: http://arxiv.org/abs/2111.09415v1
- Date: Thu, 11 Nov 2021 19:32:05 GMT
- Title: Automated PII Extraction from Social Media for Raising Privacy
Awareness: A Deep Transfer Learning Approach
- Authors: Yizhi Liu, Fang Yu Lin, Mohammadreza Ebrahimi, Weifeng Li, Hsinchun
Chen
- Abstract summary: Internet users have been exposing an increasing amount of Personally Identifiable Information (PII) on social media.
In this study, we propose the Deep Transfer Learning for PII Extraction (DTL-PIIE) framework to address these two limitations.
Our framework can facilitate various applications, such as PII misuse prediction and privacy risk assessment.
- Score: 6.806025738284367
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Internet users have been exposing an increasing amount of Personally
Identifiable Information (PII) on social media. Such exposed PII can cause
severe losses to the users, and informing users of their PII exposure is
crucial to raise their privacy awareness and encourage them to take protective
measures. To this end, advanced automatic techniques are needed. While
Information Extraction (IE) techniques can be used to extract the PII
automatically, Deep Learning (DL)-based IE models alleviate the need for
feature engineering and further improve the efficiency. However, DL-based IE
models often require large-scale labeled data for training, but PII-labeled
social media posts are difficult to obtain due to privacy concerns. Also, these
models rely heavily on pre-trained word embeddings, while PII in social media
often varies in forms and thus has no fixed representations in pre-trained word
embeddings. In this study, we propose the Deep Transfer Learning for PII
Extraction (DTL-PIIE) framework to address these two limitations. DTL-PIIE
transfers knowledge learned from publicly available PII data to social media to
address the problem of rare PII-labeled data. Moreover, our framework leverages
Graph Convolutional Networks (GCNs) to incorporate syntactic patterns to guide
PIIE without relying on pre-trained word embeddings. Evaluation against
benchmark IE models indicates that our approach outperforms state-of-the-art
DL-based IE models. Our framework can facilitate various applications, such as
PII misuse prediction and privacy risk assessment, protecting the privacy of
internet users.
Related papers
- Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training [19.119349775283556]
We find that the amount and ease of PII is a dynamic property of a model that evolves throughout training pipelines.
We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization.
arXiv Detail & Related papers (2025-02-21T18:59:14Z) - Effectiveness of L2 Regularization in Privacy-Preserving Machine Learning [1.4638393290666896]
Well-performing models, the industry seeks, usually rely on a large volume of training data.
The use of such data raises serious privacy concerns due to the potential risks of leaks of highly sensitive information.
In this work, we compare the effectiveness of L2 regularization and differential privacy in mitigating Membership Inference Attack risks.
arXiv Detail & Related papers (2024-12-02T14:31:11Z) - Towards Robust and Cost-Efficient Knowledge Unlearning for Large Language Models [25.91643745340183]
Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora.
This poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods.
We propose two novel techniques for robust and efficient unlearning for LLMs.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components.
A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss.
A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal.
An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z) - Unlearning Targeted Information via Single Layer Unlearning Gradient [15.374381635334897]
Unauthorized privacy-related computation is a significant concern for society.
The EU's General Protection Regulation includes a "right to be forgotten"
We propose Single Layer Unlearning Gradient (SLUG) to unlearn targeted information by updating targeted layers of a model.
arXiv Detail & Related papers (2024-07-16T15:52:36Z) - Ungeneralizable Examples [70.76487163068109]
Current approaches to creating unlearnable data involve incorporating small, specially designed noises.
We extend the concept of unlearnable data to conditional data learnability and introduce textbfUntextbfGeneralizable textbfExamples (UGEs)
UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers.
arXiv Detail & Related papers (2024-04-22T09:29:14Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Rethinking Privacy in Machine Learning Pipelines from an Information
Flow Control Perspective [16.487545258246932]
Modern machine learning systems use models trained on ever-growing corpora.
metadata such as ownership, access control, or licensing information is ignored during training.
We take an information flow control perspective to describe machine learning systems.
arXiv Detail & Related papers (2023-11-27T13:14:39Z) - ProPILE: Probing Privacy Leakage in Large Language Models [38.92840523665835]
Large language models (LLMs) are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data.
This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage.
arXiv Detail & Related papers (2023-07-04T18:53:47Z) - Incentivising the federation: gradient-based metrics for data selection and valuation in private decentralised training [15.233103072063951]
We investigate how to leverage gradient information to permit the participants of private training settings to select the data most beneficial for the jointly trained model.
We show that these techniques can provide the federated clients with tools for principled data selection even in stricter privacy settings.
arXiv Detail & Related papers (2023-05-04T15:44:56Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Deep Reinforcement Learning Assisted Federated Learning Algorithm for
Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment.
How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue.
This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z) - Attribute Inference Attack of Speech Emotion Recognition in Federated
Learning Settings [56.93025161787725]
Federated learning (FL) is a distributed machine learning paradigm that coordinates clients to train a model collaboratively without sharing local data.
We propose an attribute inference attack framework that infers sensitive attribute information of the clients from shared gradients or model parameters.
We show that the attribute inference attack is achievable for SER systems trained using FL.
arXiv Detail & Related papers (2021-12-26T16:50:42Z) - Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data.
FL and related techniques are often described as privacy-preserving.
We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.