Related papers: PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues

URL: http://arxiv.org/abs/2505.16931v1
Date: Thu, 22 May 2025 17:22:28 GMT
Title: PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues
Authors: Matthew Zent, Digory Smith, Simon Woodhead,
Abstract summary: PIIvot is a framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem.<n>We also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
Score: 5.264430938065097
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Personally identifiable information (PII) anonymization is a high-stakes task that poses a barrier to many open-science data sharing initiatives. While PII identification has made large strides in recent years, in practice, error thresholds and the recall/precision trade-off still limit the uptake of these anonymization pipelines. We present PIIvot, a lighter-weight framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem. To demonstrate its effectiveness, we also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.

Related papers

Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence [0.0]
CRID is a cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning.<n>Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues.<n>Our experiments show improved performance in practical cross-dataset Re-ID scenarios.
arXiv Detail & Related papers (2025-07-02T09:10:33Z)
Self-Refining Language Model Anonymizers via Adversarial Distillation [49.17383264812234]
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data poses emerging privacy risks.<n>We introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization.
arXiv Detail & Related papers (2025-06-02T08:21:27Z)
Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment [3.459382629188014]
Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension.<n>Their application to data archives might facilitate the privatization of sensitive information about the data subjects.<n>This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification.
arXiv Detail & Related papers (2025-04-03T13:26:59Z)
P2NIA: Privacy-Preserving Non-Iterative Auditing [5.619344845505019]
The emergence of AI legislation has increased the need to assess the ethical compliance of high-risk AI systems.<n>Traditional auditing methods rely on platforms' application programming interfaces (APIs)<n>We present P2NIA, a novel auditing scheme that proposes a mutually beneficial collaboration for both the auditor and the platform.
arXiv Detail & Related papers (2025-04-01T15:04:58Z)
PII-Bench: Evaluating Query-Aware Privacy Protection Systems [10.52362814808073]
We propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.<n>PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions.
arXiv Detail & Related papers (2025-02-25T14:49:08Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Disentangle Before Anonymize: A Two-stage Framework for Attribute-preserved and Occlusion-robust De-identification [55.741525129613535]
"Disentangle Before Anonymize" is a novel two-stage Framework(DBAF)<n>This framework includes a Contrastive Identity Disentanglement (CID) module and a Key-authorized Reversible Identity Anonymization (KRIA) module.<n>Extensive experiments demonstrate that our method outperforms state-of-the-art de-identification approaches.
arXiv Detail & Related papers (2023-11-15T08:59:02Z)
ProPILE: Probing Privacy Leakage in Large Language Models [38.92840523665835]
Large language models (LLMs) are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage.
arXiv Detail & Related papers (2023-07-04T18:53:47Z)
Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2. It acquires the context related attribute and relation knowledge from the knowledge base. We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z)
DP2-Pub: Differentially Private High-Dimensional Data Publication with Invariant Post Randomization [58.155151571362914]
We propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases. splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable privacy budget. We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy.
arXiv Detail & Related papers (2022-08-24T17:52:43Z)
Automated PII Extraction from Social Media for Raising Privacy Awareness: A Deep Transfer Learning Approach [6.806025738284367]
Internet users have been exposing an increasing amount of Personally Identifiable Information (PII) on social media. In this study, we propose the Deep Transfer Learning for PII Extraction (DTL-PIIE) framework to address these two limitations. Our framework can facilitate various applications, such as PII misuse prediction and privacy risk assessment.
arXiv Detail & Related papers (2021-11-11T19:32:05Z)
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks. Data augmentation methods have been explored as a means of improving data efficiency in NLP. We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z)
Improving Limited Labeled Dialogue State Tracking with Self-Supervision [91.68515201803986]
Existing dialogue state tracking (DST) models require plenty of labeled data. We present and investigate two self-supervised objectives: preserving latent consistency and modeling conversational behavior. Our proposed self-supervised signals can improve joint goal accuracy by 8.95% when only 1% labeled data is used.
arXiv Detail & Related papers (2020-10-26T21:57:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.