Re-identification of De-identified Documents with Autoregressive Infilling
- URL: http://arxiv.org/abs/2505.12859v1
- Date: Mon, 19 May 2025 08:43:54 GMT
- Title: Re-identification of De-identified Documents with Autoregressive Infilling
- Authors: Lucas Georges Gabriel Charpentier, Pierre Lison,
- Abstract summary: We present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge.<n>As many as 80% of de-identified text spans can be successfully recovered and the re-identification accuracy increases along with the level of background knowledge.
- Score: 2.2406151150434894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.
Related papers
- Keypoint Promptable Re-Identification [76.31113049256375]
Occluded Person Re-Identification (ReID) is a metric learning task that involves matching occluded individuals based on their appearance.
We introduce Keypoint Promptable ReID (KPR), a novel formulation of the ReID problem that explicitly complements the input bounding box with a set of semantic keypoints.
We release custom keypoint labels for four popular ReID benchmarks. Experiments on person retrieval, but also on pose tracking, demonstrate that our method systematically surpasses previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-25T15:20:58Z) - StableIdentity: Inserting Anybody into Anywhere at First Sight [57.99693188913382]
We propose StableIdentity, which allows identity-consistent recontextualization with just one face image.
We are the first to directly inject the identity learned from a single image into video/3D generation without finetuning.
arXiv Detail & Related papers (2024-01-29T09:06:15Z) - Disentangle Before Anonymize: A Two-stage Framework for Attribute-preserved and Occlusion-robust De-identification [55.741525129613535]
"Disentangle Before Anonymize" is a novel two-stage Framework(DBAF)<n>This framework includes a Contrastive Identity Disentanglement (CID) module and a Key-authorized Reversible Identity Anonymization (KRIA) module.<n>Extensive experiments demonstrate that our method outperforms state-of-the-art de-identification approaches.
arXiv Detail & Related papers (2023-11-15T08:59:02Z) - Neural Text Sanitization with Privacy Risk Indicators: An Empirical
Analysis [2.9311414545087366]
We consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance.
The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information.
We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search.
arXiv Detail & Related papers (2023-10-22T14:17:27Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - Towards Privacy-Preserving Person Re-identification via Person Identify
Shift [19.212691296927165]
Person re-identification (ReID) requires preserving the privacy of pedestrian images used by ReID methods.
We propose a novel de-identification method designed explicitly for person ReID, named Person Identify Shift (PIS)
PIS shifts each pedestrian image from the current identity to another with a new identity, resulting in images still preserving the relative identities.
arXiv Detail & Related papers (2022-07-15T06:58:41Z) - Fairness for Text Classification Tasks with Identity Information Data
Augmentation Methods [2.5199066832791535]
Methods are entirely based on generating counterfactuals for the given training and test set instances.
We empirically show that the two-stage augmentation process leads to diverse identity pairs and an enhanced training set.
arXiv Detail & Related papers (2022-02-04T07:08:30Z) - Taking Modality-free Human Identification as Zero-shot Learning [46.51413603352702]
We develop a novel Modality-Free Human Identification (named MFHI) task as a generic zero-shot learning model in a scalable way.
It is capable of bridging the visual and semantic modalities by learning a discriminative prototype of each identity.
In addition, the semantics-guided spatial attention is enforced on visual modality to obtain representations with both high global category-level and local attribute-level discrimination.
arXiv Detail & Related papers (2020-10-02T13:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.