Related papers: Detection of Personal Data in Structured Datasets Using a Large Language Model

Detection of Personal Data in Structured Datasets Using a Large Language Model

URL: http://arxiv.org/abs/2506.22305v1
Date: Fri, 27 Jun 2025 15:16:43 GMT
Title: Detection of Personal Data in Structured Datasets Using a Large Language Model
Authors: Albert Agisha Ntwali, Luca Rück, Martin Heckmann,
Abstract summary: We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o.<n>We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key innovation of our method is the incorporation of contextual information: in addition to a feature's name and values, we utilize information from other feature names within the dataset as well as the dataset description. We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets: DeSSI, a large synthetic dataset, datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a real-world dataset containing patient information from critical care units. Our findings reveal that detection performance varies significantly depending on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is comparable across all models, with our GPT-4o-based approach clearly outperforming the others. Notably, personal data detection in the Kaggle and OpenML datasets appears to benefit from contextual information. This is evidenced by the poor performance of CASSED and Presidio (both of which do not utilize the context of the dataset) compared to the strong results of our GPT-4o-based approach. We conclude that further progress in this field would greatly benefit from the availability of more real-world datasets containing personal information.

Related papers

DataMIL: Selecting Data for Robot Imitation Learning with Datamodels [77.48472034791213]
We introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm.<n>Unlike standard practices that filter data using human notions of quality, DataMIL directly optimize data selection for task success.<n>We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks.
arXiv Detail & Related papers (2025-05-14T17:55:10Z)
Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models [73.94175015918059]
We propose a dataset-level membership inference method based on Self-Comparison. Our method does not require access to ground-truth member data or non-member data in identical distribution.
arXiv Detail & Related papers (2024-10-16T23:05:59Z)
Metadata-based Data Exploration with Retrieval-Augmented Generation for Large Language Models [3.7685718201378746]
This research introduces a new architecture for data exploration which employs a form of Retrieval-Augmented Generation (RAG) to enhance metadata-based data discovery. The proposed framework offers a new method for evaluating semantic similarity among heterogeneous data sources.
arXiv Detail & Related papers (2024-10-05T17:11:37Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
Proper Dataset Valuation by Pointwise Mutual Information [26.693741797887643]
We propose an information-theoretic framework for evaluating data curation methods.<n>We define dataset quality in terms of its informativeness about the true model parameters.<n>We show that the Blackwell order can be determined by the Shannon mutual information between the curated data and the test data.
arXiv Detail & Related papers (2024-05-28T15:04:17Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions [100.52917027038369]
We operationalize the task of recommending datasets given a short natural language description. To facilitate this task, we build the DataFinder dataset which consists of a larger automatically-constructed training set and a smaller expert-annotated evaluation set. This system, trained on the DataFinder dataset, finds more relevant search results than existing third-party dataset search engines.
arXiv Detail & Related papers (2023-05-26T05:22:36Z)
Revisiting Table Detection Datasets for Visually Rich Documents [17.846536373106268]
This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables. To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets. Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
arXiv Detail & Related papers (2023-05-04T01:08:15Z)
Combining datasets to increase the number of samples and improve model fitting [7.4771091238795595]
We propose a novel framework called Combine datasets based on Imputation (ComImp) In addition, we propose a variant of ComImp that uses Principle Component Analysis (PCA), PCA-ComImp in order to reduce dimension before combining datasets. Our results indicate that the proposed methods are somewhat similar to transfer learning in that the merge can significantly improve the accuracy of a prediction model on smaller datasets.
arXiv Detail & Related papers (2022-10-11T06:06:37Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
Cross-Dataset Collaborative Learning for Semantic Segmentation [17.55660581677053]
We present a simple, flexible, and general method for semantic segmentation, termed Cross-Dataset Collaborative Learning (CDCL) Given multiple labeled datasets, we aim to improve the generalization and discrimination of feature representations on each dataset. We conduct extensive evaluations on four diverse datasets, i.e., Cityscapes, BDD100K, CamVid, and COCO Stuff, with single-dataset and cross-dataset settings.
arXiv Detail & Related papers (2021-03-21T09:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.