PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues
- URL: http://arxiv.org/abs/2505.16931v1
- Date: Thu, 22 May 2025 17:22:28 GMT
- Title: PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues
- Authors: Matthew Zent, Digory Smith, Simon Woodhead,
- Abstract summary: PIIvot is a framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem.<n>We also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
- Score: 5.264430938065097
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
Related papers
- Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence [0.0]
CRID is a cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning.<n>Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues.<n>Our experiments show improved performance in practical cross-dataset Re-ID scenarios.
arXiv Detail & Related papers (2025-07-02T09:10:33Z) - Self-Refining Language Model Anonymizers via Adversarial Distillation [49.17383264812234]
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data poses emerging privacy risks.<n>We introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization.
arXiv Detail & Related papers (2025-06-02T08:21:27Z) - Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment [3.459382629188014]
Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension.<n>Their application to data archives might facilitate the privatization of sensitive information about the data subjects.<n>This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification.
arXiv Detail & Related papers (2025-04-03T13:26:59Z) - P2NIA: Privacy-Preserving Non-Iterative Auditing [5.619344845505019]
The emergence of AI legislation has increased the need to assess the ethical compliance of high-risk AI systems.<n>Traditional auditing methods rely on platforms' application programming interfaces (APIs)<n>We present P2NIA, a novel auditing scheme that proposes a mutually beneficial collaboration for both the auditor and the platform.
arXiv Detail & Related papers (2025-04-01T15:04:58Z) - PII-Bench: Evaluating Query-Aware Privacy Protection Systems [10.52362814808073]
We propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.<n>PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions.
arXiv Detail & Related papers (2025-02-25T14:49:08Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Disentangle Before Anonymize: A Two-stage Framework for Attribute-preserved and Occlusion-robust De-identification [55.741525129613535]
"Disentangle Before Anonymize" is a novel two-stage Framework(DBAF)<n>This framework includes a Contrastive Identity Disentanglement (CID) module and a Key-authorized Reversible Identity Anonymization (KRIA) module.<n>Extensive experiments demonstrate that our method outperforms state-of-the-art de-identification approaches.
arXiv Detail & Related papers (2023-11-15T08:59:02Z) - ProPILE: Probing Privacy Leakage in Large Language Models [38.92840523665835]
Large language models (LLMs) are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data.
This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage.
arXiv Detail & Related papers (2023-07-04T18:53:47Z) - Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2.
It acquires the context related attribute and relation knowledge from the knowledge base.
We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z) - DP2-Pub: Differentially Private High-Dimensional Data Publication with
Invariant Post Randomization [58.155151571362914]
We propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases.
splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable privacy budget.
We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy.
arXiv Detail & Related papers (2022-08-24T17:52:43Z) - Automated PII Extraction from Social Media for Raising Privacy
Awareness: A Deep Transfer Learning Approach [6.806025738284367]
Internet users have been exposing an increasing amount of Personally Identifiable Information (PII) on social media.
In this study, we propose the Deep Transfer Learning for PII Extraction (DTL-PIIE) framework to address these two limitations.
Our framework can facilitate various applications, such as PII misuse prediction and privacy risk assessment.
arXiv Detail & Related papers (2021-11-11T19:32:05Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z) - Improving Limited Labeled Dialogue State Tracking with Self-Supervision [91.68515201803986]
Existing dialogue state tracking (DST) models require plenty of labeled data.
We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior.
Our proposed self-supervised signals can improve joint goal accuracy by 8.95% when only 1% labeled data is used.
arXiv Detail & Related papers (2020-10-26T21:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.