Related papers: Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift?

Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift?

URL: http://arxiv.org/abs/2505.22843v2
Date: Wed, 25 Jun 2025 09:30:26 GMT
Title: Aurora: Are Android Malware Classifiers Reliable and Stable under Distribution Shift?
Authors: Alexander Herzog, Aliai Eusebi, Lorenzo Cavallaro,
Abstract summary: AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.
Score: 51.12297424766236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The performance figures of modern drift-adaptive malware classifiers appear promising, but does this translate to genuine operational reliability? The standard evaluation paradigm primarily focuses on baseline performance metrics, neglecting confidence-error alignment and operational stability. While TESSERACT established the importance of temporal evaluation, we take a complementary direction by investigating whether malware classifiers maintain reliable and stable confidence estimates under distribution shifts and exploring the tensions between scientific advancement and practical impacts when they do not. We propose AURORA, a framework to evaluate malware classifiers based on their confidence quality and operational resilience. AURORA subjects the confidence profile of a given model to verification to assess the reliability of its estimates. Unreliable confidence estimates erode operational trust, waste valuable annotation budget on non-informative samples for active learning, and leave error-prone instances undetected in selective classification. AURORA is complemented by a set of metrics designed to go beyond point-in-time performance, striving towards a more holistic assessment of operational stability throughout temporal evaluation periods. The fragility in SOTA frameworks across datasets of varying drift suggests the need for a return to the whiteboard.

Related papers

Evaluating the Evaluators: Trust in Adversarial Robustness Tests [17.06660302788049]
AttackBench is an evaluation tool that ranks existing attack implementations based on a novel optimality metric.<n>The framework enforces consistent testing conditions and enables continuous updates, making it a reliable foundation for robustness verification.
arXiv Detail & Related papers (2025-07-04T10:07:26Z)
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic [0.12499537119440243]
We propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL)<n>In particular, we define formal STL-based constraints to capture desirable temporal properties and compute scores that serve as structured, interpretable confidence estimates.<n>Our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
arXiv Detail & Related papers (2025-06-09T21:21:12Z)
Active Test-time Vision-Language Navigation [60.69722522420299]
ATENA is a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes.<n>In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration.<n>In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions.
arXiv Detail & Related papers (2025-06-07T02:24:44Z)
Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness [10.220692937750295]
Reliable Score ( RSS) is a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean.<n>We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
arXiv Detail & Related papers (2025-06-06T09:37:45Z)
Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention [65.47632669243657]
A dishonest institution can exploit mechanisms to discriminate or unjustly deny services under the guise of uncertainty.<n>We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage.<n>We propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence.
arXiv Detail & Related papers (2025-05-29T19:47:50Z)
MCQA-Eval: Efficient Confidence Evaluation in NLG with Gold-Standard Correctness Labels [16.300463494913593]
Large Language Models (LLMs) require robust confidence estimation.<n>McQCA-Eval is an evaluation framework for assessing confidence measures in Natural Language Generation.
arXiv Detail & Related papers (2025-02-20T05:09:29Z)
ReliOcc: Towards Reliable Semantic Occupancy Prediction via Uncertainty Learning [26.369237406972577]
Vision-centric semantic occupancy prediction plays a crucial role in autonomous driving. There is still few research effort to explore the reliability in predicting semantic occupancy from camera. We propose ReliOcc, a method designed to enhance the reliability of camera-based occupancy networks.
arXiv Detail & Related papers (2024-09-26T16:33:16Z)
Trustworthiness for an Ultra-Wideband Localization Service [2.4979362117484714]
This paper proposes a holistic trustworthiness assessment framework for ultra-wideband self-localization. Our goal is to provide guidance for evaluating a system's trustworthiness based on objective evidence. Our approach guarantees that the resulting trustworthiness indicators correspond to chosen real-world threats.
arXiv Detail & Related papers (2024-08-10T11:57:10Z)
Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z)
SureFED: Robust Federated Learning via Uncertainty-Aware Inward and Outward Inspection [29.491675102478798]
We introduce SureFED, a novel framework for robust federated learning. SureFED establishes trust using the local information of benign clients. We theoretically prove the robustness of our algorithm against data and model poisoning attacks.
arXiv Detail & Related papers (2023-08-04T23:51:05Z)
TrustGuard: GNN-based Robust and Explainable Trust Evaluation with Dynamicity Support [59.41529066449414]
We propose TrustGuard, a GNN-based accurate trust evaluation model that supports trust dynamicity. TrustGuard is designed with a layered architecture that contains a snapshot input layer, a spatial aggregation layer, a temporal aggregation layer, and a prediction layer. Experiments show that TrustGuard outperforms state-of-the-art GNN-based trust evaluation models with respect to trust prediction across single-timeslot and multi-timeslot.
arXiv Detail & Related papers (2023-06-23T07:39:12Z)
Trust, but Verify: Using Self-Supervised Probing to Improve Trustworthiness [29.320691367586004]
We introduce a new approach of self-supervised probing, which enables us to check and mitigate the overconfidence issue for a trained model. We provide a simple yet effective framework, which can be flexibly applied to existing trustworthiness-related methods in a plug-and-play manner.
arXiv Detail & Related papers (2023-02-06T08:57:20Z)
RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation. We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks. We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
Adversarial Robustness on In- and Out-Distribution Improves Explainability [109.68938066821246]
RATIO is a training procedure for robustness via Adversarial Training on In- and Out-distribution. RATIO achieves state-of-the-art $l$-adrial on CIFAR10 and maintains better clean accuracy.
arXiv Detail & Related papers (2020-03-20T18:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.