Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2308.11103v2
- Date: Sun, 19 May 2024 09:25:45 GMT
- Title: Anonymity at Risk? Assessing Re-Identification Capabilities of Large Language Models
- Authors: Alex Nyffenegger, Matthias Stürmer, Joel Niklaus,
- Abstract summary: In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings.
We construct a proof-of-concept using actual legal data from the Swiss federal supreme court.
We analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning.
- Score: 2.3993515715868705
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Anonymity of both natural and legal persons in court rulings is a critical aspect of privacy protection in the European Union and Switzerland. With the advent of LLMs, concerns about large-scale re-identification of anonymized persons are growing. In accordance with the Federal Supreme Court of Switzerland, we explore the potential of LLMs to re-identify individuals in court rulings by constructing a proof-of-concept using actual legal data from the Swiss federal supreme court. Following the initial experiment, we constructed an anonymized Wikipedia dataset as a more rigorous testing ground to further investigate the findings. With the introduction and application of the new task of re-identifying people in texts, we also introduce new metrics to measure performance. We systematically analyze the factors that influence successful re-identifications, identifying model size, input length, and instruction tuning among the most critical determinants. Despite high re-identification rates on Wikipedia, even the best LLMs struggled with court decisions. The complexity is attributed to the lack of test datasets, the necessity for substantial training resources, and data sparsity in the information used for re-identification. In conclusion, this study demonstrates that re-identification using LLMs may not be feasible for now, but as the proof-of-concept on Wikipedia showed, it might become possible in the future. We hope that our system can help enhance the confidence in the security of anonymized decisions, thus leading to the courts being more confident to publish decisions.
Related papers
- Towards Privacy-Preserving Revocation of Verifiable Credentials with Time-Flexibility [0.36832029288386137]
Self-Sovereign Identity (SSI) is an emerging paradigm for authentication and credential presentation.
The EUDI Digital Identity wallet is about to become a concrete implementation of this paradigm.
We propose the basis of a novel method that customizes the use of anonymous hierarchical identity-based encryption.
arXiv Detail & Related papers (2025-03-27T21:58:32Z) - AIDBench: A benchmark for evaluating the authorship identification capability of large language models [14.866356328321126]
We focus on a specific privacy risk where large language models (LLMs) may help identify the authorship of anonymous texts.
We present AIDBench, a new benchmark that incorporates several author identification datasets, including emails, blogs, reviews, articles, and research papers.
Our experiments with AIDBench demonstrate that LLMs can correctly guess authorship at rates well above random chance, revealing new privacy risks posed by these powerful models.
arXiv Detail & Related papers (2024-11-20T11:41:08Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Low-Resource Court Judgment Summarization for Common Law Systems [32.13166048504629]
We present CLSum, the first dataset for summarizing multi-jurisdictional common law court judgment documents.
This is the first court judgment summarization work adopting large language models (LLMs) in data augmentation, summary generation, and evaluation.
arXiv Detail & Related papers (2024-03-07T12:47:42Z) - Large Language Models are Advanced Anonymizers [13.900633576526863]
We show how adversarial anonymization outperforms current industry-grade anonymizers in terms of the resulting utility and privacy.
We first present a new setting for evaluating anonymizations in the face of adversarial LLMs inferences.
arXiv Detail & Related papers (2024-02-21T14:44:00Z) - Disentangle Before Anonymize: A Two-stage Framework for Attribute-preserved and Occlusion-robust De-identification [55.741525129613535]
"Disentangle Before Anonymize" is a novel two-stage Framework(DBAF)
This framework includes a Contrastive Identity Disentanglement (CID) module and a Key-authorized Reversible Identity Anonymization (KRIA) module.
Extensive experiments demonstrate that our method outperforms state-of-the-art de-identification approaches.
arXiv Detail & Related papers (2023-11-15T08:59:02Z) - Interpretable Long-Form Legal Question Answering with
Retrieval-Augmented Large Language Models [10.834755282333589]
Long-form Legal Question Answering dataset comprises 1,868 expert-annotated legal questions in the French language.
Our experimental results demonstrate promising performance on automatic evaluation metrics.
As one of the only comprehensive, expert-annotated long-form LQA dataset, LLeQA has the potential to not only accelerate research towards resolving a significant real-world issue, but also act as a rigorous benchmark for evaluating NLP models in specialized domains.
arXiv Detail & Related papers (2023-09-29T08:23:19Z) - FedSOV: Federated Model Secure Ownership Verification with Unforgeable
Signature [60.99054146321459]
Federated learning allows multiple parties to collaborate in learning a global model without revealing private data.
We propose a cryptographic signature-based federated learning model ownership verification scheme named FedSOV.
arXiv Detail & Related papers (2023-05-10T12:10:02Z) - Unsupervised Text Deidentification [101.2219634341714]
We propose an unsupervised deidentification method that masks words that leak personally-identifying information.
Motivated by K-anonymity based privacy, we generate redactions that ensure a minimum reidentification rank.
arXiv Detail & Related papers (2022-10-20T18:54:39Z) - Reinforcement Learning on Encrypted Data [58.39270571778521]
We present a preliminary, experimental study of how a DQN agent trained on encrypted states performs in environments with discrete and continuous state spaces.
Our results highlight that the agent is still capable of learning in small state spaces even in presence of non-deterministic encryption, but performance collapses in more complex environments.
arXiv Detail & Related papers (2021-09-16T21:59:37Z) - Unsupervised Person Re-Identification: A Systematic Survey of Challenges
and Solutions [64.68497473454816]
Unsupervised person Re-ID has drawn increasing attention for its potential to address the scalability issue in person Re-ID.
Unsupervised person Re-ID is challenging primarily due to lacking identity labels to supervise person feature learning.
This survey review recent works on unsupervised person Re-ID from the perspective of challenges and solutions.
arXiv Detail & Related papers (2021-09-01T00:01:35Z) - How important are faces for person re-identification? [14.718372669984364]
We apply a face detection and blurring algorithm to create anonymized versions of several popular person re-identification datasets.
We evaluate the effect of this anonymization on re-identification performance using standard metrics.
arXiv Detail & Related papers (2020-10-13T11:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.