A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions
- URL: http://arxiv.org/abs/2405.02344v2
- Date: Wed, 01 Oct 2025 00:51:33 GMT
- Title: A Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attributions
- Authors: Peiyu Yang, Naveed Akhtar, Jiantong Jiang, Ajmal Mian,
- Abstract summary: We first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.<n>We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.<n>Our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
- Score: 60.06461883533697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attribution methods compute importance scores for input features to explain model predictions. However, assessing the faithfulness of these methods remains challenging due to the absence of attribution ground truth to model predictions. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill, thereby facilitating a systematic assessment of attribution benchmarks. Next, we introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria. We theoretically establish the superiority of our approach over the existing benchmarks for well-founded attribution evaluation. With extensive analysis, we further establish a standardized evaluation setup that mitigates confounding factors such as post-processing techniques and explained predictions, thereby ensuring a fair and consistent benchmarking. This setup is ultimately employed for a comprehensive comparison of existing methods using BackX. Finally, our analysis also offers insights into defending against neural Trojans by utilizing the attributions.
Related papers
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z) - STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z) - Uncovering Competency Gaps in Large Language Models and Their Benchmarks [11.572508874955659]
We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps.<n>We found that models consistently underperformed on concepts that stand in contrast to sycophantic behaviors.<n>Our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores.
arXiv Detail & Related papers (2025-12-06T17:39:47Z) - How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation [11.33816414982401]
Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task.<n>Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined.<n>We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed.
arXiv Detail & Related papers (2025-10-07T20:38:12Z) - Aligning the Evaluation of Probabilistic Predictions with Downstream Value [2.6636053598505307]
Metrics based solely on predictive performance often diverge from measures of real-world downstream impact.<n>We propose a data-driven method to learn a proxy evaluation function aligned with the downstream evaluation.<n>Our approach leverages weighted scoring rules parametrized by a neural network, where weighting is learned to align with the performance in the downstream task.
arXiv Detail & Related papers (2025-08-25T17:41:27Z) - Are Bias Evaluation Methods Biased ? [3.9748528039819977]
The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community.<n>We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings.
arXiv Detail & Related papers (2025-06-20T16:11:25Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - Confidence in Large Language Model Evaluation: A Bayesian Approach to Limited-Sample Challenges [13.526258635654882]
This study introduces a Bayesian approach for large language models (LLMs) capability assessment.
We treat model capabilities as latent variables and leverage a curated query set to induce discriminative responses.
Experimental evaluations with GPT-series models demonstrate that the proposed method achieves superior discrimination compared to conventional evaluation methods.
arXiv Detail & Related papers (2025-04-30T04:24:50Z) - Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs.
We propose a new evaluation methodology that accounts for the groundedness of predictions.
Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z) - Rethinking Robustness in Machine Learning: A Posterior Agreement Approach [45.284633306624634]
Posterior Agreement (PA) theory of model validation provides a principled framework for robustness evaluation.
We show that the PA metric provides a sensible and consistent analysis of the vulnerabilities in learning algorithms, even in the presence of few observations.
arXiv Detail & Related papers (2025-03-20T16:03:39Z) - BEExAI: Benchmark to Evaluate Explainable AI [0.9176056742068812]
We propose BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods.
We argue that the need for a reliable way of measuring the quality and correctness of explanations is becoming critical.
arXiv Detail & Related papers (2024-07-29T11:21:17Z) - On the Evaluation Consistency of Attribution-based Explanations [42.1421504321572]
We introduce Meta-Rank, an open platform for benchmarking attribution methods in the image domain.
Our benchmark reveals three insights in attribution evaluation endeavors: 1) evaluating attribution methods under disparate settings can yield divergent performance rankings; 2) although inconsistent across numerous cases, the performance rankings exhibit remarkable consistency across distinct checkpoints along the same training trajectory; and 3) prior attempts at consistent evaluation fare no better than baselines when extended to more heterogeneous models and datasets.
arXiv Detail & Related papers (2024-07-28T11:49:06Z) - Trustworthy Classification through Rank-Based Conformal Prediction Sets [9.559062601251464]
We propose a novel conformal prediction method that employs a rank-based score function suitable for classification models.
Our approach constructs prediction sets that achieve the desired coverage rate while managing their size.
Our contributions include a novel conformal prediction method, theoretical analysis, and empirical evaluation.
arXiv Detail & Related papers (2024-07-05T10:43:41Z) - Conformal Approach To Gaussian Process Surrogate Evaluation With
Coverage Guarantees [47.22930583160043]
We propose a method for building adaptive cross-conformal prediction intervals.
The resulting conformal prediction intervals exhibit a level of adaptivity akin to Bayesian credibility sets.
The potential applicability of the method is demonstrated in the context of surrogate modeling of an expensive-to-evaluate simulator of the clogging phenomenon in steam generators of nuclear reactors.
arXiv Detail & Related papers (2024-01-15T14:45:18Z) - A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL)
The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics.
Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z) - TopP&R: Robust Support Estimation Approach for Evaluating Fidelity and
Diversity in Generative Models [9.048102020202817]
Topological Precision and Recall (TopP&R) provides a systematic approach to estimating supports.
We show that TopP&R is robust to outliers and non-independent and identically distributed (Non-IID) perturbations.
This is the first evaluation metric focused on the robust estimation of the support and provides its statistical consistency under noise.
arXiv Detail & Related papers (2023-06-13T11:46:00Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z) - A Unified Taylor Framework for Revisiting Attribution Methods [49.03783992773811]
We propose a Taylor attribution framework and reformulate seven mainstream attribution methods into the framework.
We establish three principles for a good attribution in the Taylor attribution framework.
arXiv Detail & Related papers (2020-08-21T22:07:06Z) - Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis.
We obtain new explanations that are loosely necessary and sufficient for a prediction.
We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.