Related papers: Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention

URL: http://arxiv.org/abs/2505.23968v1
Date: Thu, 29 May 2025 19:47:50 GMT
Title: Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention
Authors: Stephan Rabanser, Ali Shahin Shamsabadi, Olive Franzese, Xiao Wang, Adrian Weller, Nicolas Papernot,
Abstract summary: A dishonest institution can exploit mechanisms to discriminate or unjustly deny services under the guise of uncertainty.<n>We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage.<n>We propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence.
Score: 65.47632669243657
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cautious predictions -- where a machine learning model abstains when uncertain -- are crucial for limiting harmful errors in safety-critical applications. In this work, we identify a novel threat: a dishonest institution can exploit these mechanisms to discriminate or unjustly deny services under the guise of uncertainty. We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage, which deliberately reduces confidence in targeted input regions, thereby covertly disadvantaging specific individuals. At the same time, Mirage maintains high predictive performance across all data points. To counter this threat, we propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence. Additionally, it employs zero-knowledge proofs of verified inference to ensure that reported confidence scores genuinely originate from the deployed model. This prevents the provider from fabricating arbitrary model confidence values while protecting the model's proprietary details. Our results confirm that Confidential Guardian effectively prevents the misuse of cautious predictions, providing verifiable assurances that abstention reflects genuine model uncertainty rather than malicious intent.

Related papers

VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks [51.68795949691009]
We introduce VoxGuard, a framework grounded in differential privacy and membership inference.<n>For attributes, we show that simple transparent attacks recover gender and accent with near-perfect accuracy even after anonymization.<n>Our results demonstrate that EER substantially underestimates leakage, highlighting the need for low-FPR evaluation.
arXiv Detail & Related papers (2025-09-22T20:57:48Z)
Inducing Uncertainty on Open-Weight Models for Test-Time Privacy in Image Recognition [3.8031924942083517]
Key concern for AI safety remains understudied in the machine learning (ML) literature.<n>How can we ensure users of ML models do not leverage predictions on incorrect personal data to harm others?<n>We induce maximal uncertainty on protected instances while preserving accuracy on all other instances.
arXiv Detail & Related papers (2025-09-15T06:38:57Z)
Uncertainty-Driven Reliability: Selective Prediction and Trustworthy Deployment in Modern Machine Learning [1.2183405753834562]
This thesis investigates how uncertainty estimation can enhance the safety and trustworthiness of machine learning (ML) systems.<n>We first show that a model's training trajectory contains rich uncertainty signals that can be exploited without altering its architecture or loss.<n>We propose a lightweight, post-hoc abstention method that works across tasks, avoids the cost of deep ensembles, and achieves state-of-the-art selective prediction performance.
arXiv Detail & Related papers (2025-08-11T02:33:53Z)
Confidence Aware Learning for Reliable Face Anti-spoofing [52.23271636362843]
We propose a Confidence Aware Face Anti-spoofing model, which is aware of its capability boundary.<n>We estimate its confidence during the prediction of each sample.<n>Experiments show that the proposed CA-FAS can effectively recognize samples with low prediction confidence.
arXiv Detail & Related papers (2024-11-02T14:29:02Z)
On the Robustness of Adversarial Training Against Uncertainty Attacks [9.180552487186485]
In learning problems, the noise inherent to the task at hand hinders the possibility to infer without a certain degree of uncertainty.<n>In this work, we reveal both empirically and theoretically that defending against adversarial examples, i.e., carefully perturbed samples that cause misclassification, guarantees a more secure, trustworthy uncertainty estimate.<n>To support our claims, we evaluate multiple adversarial-robust models from the publicly available benchmark RobustBench on the CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2024-10-29T11:12:44Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users.<n>We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions.<n>We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
Jailbreaking as a Reward Misspecification Problem [80.52431374743998]
We propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process.<n>We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness.<n>We present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space.
arXiv Detail & Related papers (2024-06-20T15:12:27Z)
From Mean to Extreme: Formal Differential Privacy Bounds on the Success of Real-World Data Reconstruction Attacks [54.25638567385662]
Differential Privacy in machine learning is often interpreted as guarantees against membership inference.<n> translating DP budgets into quantitative protection against the more damaging threat of data reconstruction remains a challenging open problem.<n>This paper bridges the critical gap by deriving the first formal privacy bounds tailored to the mechanics of demonstrated "from-scratch" attacks.
arXiv Detail & Related papers (2024-02-20T09:52:30Z)
On the Impact of Uncertainty and Calibration on Likelihood-Ratio Membership Inference Attacks [42.18575921329484]
We analyze the performance of the likelihood ratio attack (LiRA) within an information-theoretical framework.<n>We derive bounds on the advantage of an MIA adversary with the aim of offering insights into the impact of uncertainty and calibration on the effectiveness of MIAs.
arXiv Detail & Related papers (2024-02-16T13:41:18Z)
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection [29.491675102478798]
We introduce SureFED, a novel framework for robust federated learning. SureFED establishes trust using the local information of benign clients. We theoretically prove the robustness of our algorithm against data and model poisoning attacks.
arXiv Detail & Related papers (2023-08-04T23:51:05Z)
Overconfidence is a Dangerous Thing: Mitigating Membership Inference Attacks by Enforcing Less Confident Prediction [2.2336243882030025]
Machine learning models are vulnerable to membership inference attacks (MIAs) This work proposes a defense technique, HAMP, that can achieve both strong membership privacy and high accuracy, without requiring extra data.
arXiv Detail & Related papers (2023-07-04T09:50:33Z)
Confidence-Calibrated Face and Kinship Verification [8.570969129199467]
We introduce an effective confidence measure that allows verification models to convert a similarity score into a confidence score for any given face pair. We also propose a confidence-calibrated approach, termed Angular Scaling (ASC), which is easy to implement and can be readily applied to existing verification models. To the best of our knowledge, our work presents the first comprehensive confidence-calibrated solution for modern face and kinship verification tasks.
arXiv Detail & Related papers (2022-10-25T10:43:46Z)
Learning Uncertainty For Safety-Oriented Semantic Segmentation In Autonomous Driving [77.39239190539871]
We show how uncertainty estimation can be leveraged to enable safety critical image segmentation in autonomous driving. We introduce a new uncertainty measure based on disagreeing predictions as measured by a dissimilarity function. We show experimentally that our proposed approach is much less computationally intensive at inference time than competing methods.
arXiv Detail & Related papers (2021-05-28T09:23:05Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.