Enhancing the De-identification of Personally Identifiable Information in Educational Data
- URL: http://arxiv.org/abs/2501.09765v1
- Date: Tue, 14 Jan 2025 20:53:38 GMT
- Title: Enhancing the De-identification of Personally Identifiable Information in Educational Data
- Authors: Y. Shen, Z. Ji, J. Lin, K. R. Koedginer,
- Abstract summary: PII, such as names, is a critical requirement to safeguard student and teacher privacy and maintain trust.<n>Our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Protecting Personally Identifiable Information (PII), such as names, is a critical requirement in learning technologies to safeguard student and teacher privacy and maintain trust. Accurate PII detection is an essential step toward anonymizing sensitive information while preserving the utility of educational data. Motivated by recent advancements in artificial intelligence, our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks. We explore both prompting and fine-tuning approaches and compare GPT-4o-mini's performance against established frameworks, including Microsoft Presidio and Azure AI Language. Our evaluation on two public datasets, CRAPII and TSCC, demonstrates that the fine-tuned GPT-4o-mini model achieves superior performance, with a recall of 0.9589 on CRAPII. Additionally, fine-tuned GPT-4o-mini significantly improves precision scores (a threefold increase) while reducing computational costs to nearly one-tenth of those associated with Azure AI Language. Furthermore, our bias analysis reveals that the fine-tuned GPT-4o-mini model consistently delivers accurate results across diverse cultural backgrounds and genders. The generalizability analysis using the TSCC dataset further highlights its robustness, achieving a recall of 0.9895 with minimal additional training data from TSCC. These results emphasize the potential of fine-tuned GPT-4o-mini as an accurate and cost-effective tool for PII detection in educational data. It offers robust privacy protection while preserving the data's utility for research and pedagogical analysis. Our code is available on GitHub: https://github.com/AnonJD/PrivacyAI
Related papers
- Privacy-Preserved Automated Scoring using Federated Learning for Educational Research [1.2556373621040728]
This study proposes a federated learning framework for automatic scoring in educational assessments.
Student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server.
We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models.
arXiv Detail & Related papers (2025-03-12T19:06:25Z) - Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning [6.537257913467247]
This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD)
Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning.
Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios.
arXiv Detail & Related papers (2025-01-15T13:46:33Z) - Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner.
Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z) - ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning [78.42927884000673]
ExACT is an approach to combine test-time search and self-learning to build o1-like models for agentic applications.<n>We first introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test time algorithm designed to enhance AI agents' ability to explore decision space on the fly.<n>Next, we introduce Exploratory Learning, a novel learning strategy to teach agents to search at inference time without relying on any external search algorithms.
arXiv Detail & Related papers (2024-10-02T21:42:35Z) - Differential Privacy Mechanisms in Neural Tangent Kernel Regression [29.187250620950927]
We study differential privacy (DP) in the Neural Tangent Kernel (NTK) regression setting.
We show provable guarantees for both differential privacy and test accuracy of our NTK regression.
To our knowledge, this is the first work to provide a DP guarantee for NTK regression.
arXiv Detail & Related papers (2024-07-18T15:57:55Z) - Initial Exploration of Zero-Shot Privacy Utility Tradeoffs in Tabular Data Using GPT-4 [2.54365580380609]
We investigate the application of large language models (LLMs) to scenarios involving the tradeoff between privacy and utility in tabular data.
Our approach entails prompting GPT-4 by transforming data points into textual format, followed by the inclusion of precise sanitization instructions in a zero-shot manner.
We find that this relatively simple approach yields performance comparable to more complex adversarial optimization methods used for managing privacy-utility tradeoffs.
arXiv Detail & Related papers (2024-04-07T19:02:50Z) - Automated Root Causing of Cloud Incidents using In-Context Learning with
GPT-4 [23.856839017006386]
Root Cause Analysis (RCA) plays a pivotal role in the incident diagnosis process for cloud services.
GPT-4 model's immense size presents challenges when trying to fine-tune it on user data.
We propose an in-context learning approach for automated root causing, which eliminates the need for fine-tuning.
arXiv Detail & Related papers (2024-01-24T21:02:07Z) - Using GPT-4 to Augment Unbalanced Data for Automatic Scoring [0.5586073503694489]
We introduce a novel text data augmentation framework leveraging GPT-4, a generative large language model.
We crafted prompts for GPT-4 to generate responses, especially for minority scoring classes.
We finetuned DistillBERT for automatic scoring based on the augmented and original datasets.
arXiv Detail & Related papers (2023-10-25T01:07:50Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Deep Reinforcement Learning Assisted Federated Learning Algorithm for
Data Management of IIoT [82.33080550378068]
The continuous expanded scale of the industrial Internet of Things (IIoT) leads to IIoT equipments generating massive amounts of user data every moment.
How to manage these time series data in an efficient and safe way in the field of IIoT is still an open issue.
This paper studies the FL technology applications to manage IIoT equipment data in wireless network environments.
arXiv Detail & Related papers (2022-02-03T07:12:36Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.