Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP
- URL: http://arxiv.org/abs/2508.03204v1
- Date: Tue, 05 Aug 2025 08:26:45 GMT
- Title: Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP
- Authors: Abhirup Sinha, Pritilata Saha, Tithi Saha,
- Abstract summary: Modern large language models require a huge amount of data to learn linguistic variations.<n>It is possible to extract private information from such language models.<n>This report focuses on a few approaches for domain-agnostic NLP tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Privacy is a fundamental human right. Data privacy is protected by different regulations, such as GDPR. However, modern large language models require a huge amount of data to learn linguistic variations, and the data often contains private information. Research has shown that it is possible to extract private information from such language models. Thus, anonymizing such private and sensitive information is of utmost importance. While complete anonymization may not be possible, a number of different pre-processing approaches exist for masking or pseudonymizing private information in textual data. This report focuses on a few of such approaches for domain-agnostic NLP tasks.
Related papers
- Token-Level Privacy in Large Language Models [7.4143291213663955]
We introduce dchi-stencil, a novel token-level privacy-preserving mechanism that integrates contextual and semantic information.<n>By incorporating both semantic and contextual nuances, dchi-stencil achieves a robust balance between privacy and utility.<n>This work highlights the potential of dchi-stencil to set a new standard for privacy-preserving NLP in modern, high-risk applications.
arXiv Detail & Related papers (2025-03-05T16:27:25Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - PLUE: Language Understanding Evaluation Benchmark for Privacy Policies
in English [77.79102359580702]
We introduce the Privacy Policy Language Understanding Evaluation benchmark, a multi-task benchmark for evaluating the privacy policy language understanding.
We also collect a large corpus of privacy policies to enable privacy policy domain-specific language model pre-training.
We demonstrate that domain-specific continual pre-training offers performance improvements across all tasks.
arXiv Detail & Related papers (2022-12-20T05:58:32Z) - Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining [75.25943383604266]
We question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving.
We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy.
We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
arXiv Detail & Related papers (2022-12-13T10:41:12Z) - How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss.
We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z) - Algorithms with More Granular Differential Privacy Guarantees [65.3684804101664]
We consider partial differential privacy (DP), which allows quantifying the privacy guarantee on a per-attribute basis.
In this work, we study several basic data analysis and learning tasks, and design algorithms whose per-attribute privacy parameter is smaller that the best possible privacy parameter for the entire record of a person.
arXiv Detail & Related papers (2022-09-08T22:43:50Z) - Differential Privacy in Natural Language Processing: The Story So Far [21.844047604993687]
This paper aims to summarize the vulnerabilities addressed by Differential Privacy.
This topic has sparked novel research, which is unified in one basic goal: how can one adapt Differential Privacy to NLP methods?
arXiv Detail & Related papers (2022-08-17T08:15:44Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Selective Differential Privacy for Language Modeling [36.64464956102432]
Previous work has attempted to tackle this challenge by training RNN-based language models with differential privacy guarantees.
We propose a new privacy notion, selective differential privacy, to provide rigorous privacy guarantees on the sensitive portion of the data.
Experiments on both language modeling and dialog system building show that the proposed privacy-preserving mechanism achieves better utilities.
arXiv Detail & Related papers (2021-08-30T01:11:10Z) - KART: Privacy Leakage Framework of Language Models Pre-trained with
Clinical Records [0.0]
We empirically evaluated the privacy risk of language models, using several BERT models pre-trained with MIMIC-III corpus.
BERT models were probably low-risk because the Top-100 accuracy of each attack was far below expected by chance.
We formalized various privacy leakage scenarios under a universal novel framework named Knowledge, Anonymization, Resource, and Target (KART) framework.
arXiv Detail & Related papers (2020-12-31T19:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.