Related papers: PII-Bench: Evaluating Query-Aware Privacy Protection Systems

PII-Bench: Evaluating Query-Aware Privacy Protection Systems

URL: http://arxiv.org/abs/2502.18545v1
Date: Tue, 25 Feb 2025 14:49:08 GMT
Title: PII-Bench: Evaluating Query-Aware Privacy Protection Systems
Authors: Hao Shen, Zhouhong Gu, Haokai Hong, Weili Han,
Abstract summary: We propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems.<n>PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions.
Score: 10.52362814808073
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.

Related papers

MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation [54.410825977390274]
Existing benchmarks to evaluate contextual privacy in LLM-agents primarily assess single-turn, low-complexity tasks.<n>We first present a benchmark - MAGPIE comprising 158 real-life high-stakes scenarios across 15 domains.<n>We then evaluate the current state-of-the-art LLMs on their understanding of contextually private data and their ability to collaborate without violating user privacy.
arXiv Detail & Related papers (2025-06-25T18:04:25Z)
PIIvot: A Lightweight NLP Anonymization Framework for Question-Anchored Tutoring Dialogues [5.264430938065097]
PIIvot is a framework for PII anonymization that leverages knowledge of the data context to simplify the PII detection problem.<n>We also contribute QATD-2k, the largest open-source real-world tutoring dataset of its kind, to support the demand for quality educational dialogue data.
arXiv Detail & Related papers (2025-05-22T17:22:28Z)
P2NIA: Privacy-Preserving Non-Iterative Auditing [5.619344845505019]
The emergence of AI legislation has increased the need to assess the ethical compliance of high-risk AI systems. Traditional auditing methods rely on platforms' application programming interfaces (APIs) We present P2NIA, a novel auditing scheme that proposes a mutually beneficial collaboration for both the auditor and the platform.
arXiv Detail & Related papers (2025-04-01T15:04:58Z)
ACEBench: Who Wins the Match Point in Tool Usage? [68.54159348899891]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)<n>It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.<n>It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
A Tale of Two Imperatives: Privacy and Explainability [0.0]
Deep learning's preponderance across scientific domains has reshaped high-stakes decision-making.<n>This paper examines the complexities of combining Right-to-Privacy (RTP) and Right-to-Explanation (RTE)
arXiv Detail & Related papers (2024-12-30T08:43:28Z)
A Review of Bayesian Uncertainty Quantification in Deep Probabilistic Image Segmentation [0.0]
Advancements in image segmentation play an integral role within the broad scope of Deep Learning-based Computer Vision.<n>Uncertainty quantification has been extensively studied within this context, enabling the expression of model ignorance (epistemic uncertainty) or data ambiguity (aleatoric uncertainty) to prevent uninformed decision-making.
arXiv Detail & Related papers (2024-11-25T13:26:09Z)
PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLMs [8.98944128441731]
We introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs. We extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies. We show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model.
arXiv Detail & Related papers (2024-10-09T09:16:25Z)
Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction [0.6138671548064356]
PII in text data is crucial for privacy, but current generalization methods face challenges such as uneven data distributions and limited context awareness. We propose two approaches: a feature-based method using machine learning to improve performance on structured inputs, and a novel context-aware framework that considers the broader context and semantic relationships between the original text and generalized candidates. Experiments on the WikiReplace dataset demonstrate the effectiveness of both methods, with the context-aware approach outperforming the feature-based one across different scales.
arXiv Detail & Related papers (2024-07-03T06:32:03Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE) QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Improving Query-Focused Meeting Summarization with Query-Relevant Knowledge [71.14873115781366]
We propose a knowledge-enhanced two-stage framework called Knowledge-Aware Summarizer (KAS) to tackle the challenges. In the first stage, we introduce knowledge-aware scores to improve the query-relevant segment extraction. In the second stage, we incorporate query-relevant knowledge in the summary generation.
arXiv Detail & Related papers (2023-09-05T10:26:02Z)
ProPILE: Probing Privacy Leakage in Large Language Models [38.92840523665835]
Large language models (LLMs) are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage.
arXiv Detail & Related papers (2023-07-04T18:53:47Z)
Dual Semantic Knowledge Composed Multimodal Dialog Systems [114.52730430047589]
We propose a novel multimodal task-oriented dialog system named MDS-S2. It acquires the context related attribute and relation knowledge from the knowledge base. We also devise a set of latent query variables to distill the semantic information from the composed response representation.
arXiv Detail & Related papers (2023-05-17T06:33:26Z)
Wild Face Anti-Spoofing Challenge 2023: Benchmark and Results [73.98594459933008]
Face anti-spoofing (FAS) is an essential mechanism for safeguarding the integrity of automated face recognition systems. This limitation can be attributed to the scarcity and lack of diversity in publicly available FAS datasets. We introduce the Wild Face Anti-Spoofing dataset, a large-scale, diverse FAS dataset collected in unconstrained settings.
arXiv Detail & Related papers (2023-04-12T10:29:42Z)
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework [89.8609061423685]
We propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task. To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks. We demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies.
arXiv Detail & Related papers (2023-02-23T18:59:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.