Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond
- URL: http://arxiv.org/abs/2410.07158v2
- Date: Thu, 10 Oct 2024 16:36:19 GMT
- Title: Quanda: An Interpretability Toolkit for Training Data Attribution Evaluation and Beyond
- Authors: Dilyara Bareeva, Galip Ümit Yolcu, Anna Hedström, Niklas Schmolenski, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin,
- Abstract summary: Training data attribution (TDA) methods have emerged as a promising direction for the interpretability of neural networks.
We introduce Quanda, a Python toolkit designed to facilitate the evaluation of TDA methods.
- Score: 14.062323566963972
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In recent years, training data attribution (TDA) methods have emerged as a promising direction for the interpretability of neural networks. While research around TDA is thriving, limited effort has been dedicated to the evaluation of attributions. Similar to the development of evaluation metrics for traditional feature attribution approaches, several standalone metrics have been proposed to evaluate the quality of TDA methods across various contexts. However, the lack of a unified framework that allows for systematic comparison limits trust in TDA methods and stunts their widespread adoption. To address this research gap, we introduce Quanda, a Python toolkit designed to facilitate the evaluation of TDA methods. Beyond offering a comprehensive set of evaluation metrics, Quanda provides a uniform interface for seamless integration with existing TDA implementations across different repositories, thus enabling systematic benchmarking. The toolkit is user-friendly, thoroughly tested, well-documented, and available as an open-source library on PyPi and under https://github.com/dilyabareeva/quanda.
Related papers
- TAB: Unified Benchmarking of Time Series Anomaly Detection Methods [30.95339040034454]
Time series anomaly detection (TSAD) plays an important role in many domains such as finance, transportation, and healthcare.<n>We propose a new time series anomaly detection benchmark, called TAB.
arXiv Detail & Related papers (2025-06-22T14:19:36Z) - Beyond Models! Explainable Data Valuation and Metric Adaption for Recommendation [10.964035199849125]
Current methods employ data valuation to discern high-quality data from low-quality data.
We propose an explainable and versatile framework DVR which can enhance the efficiency of data utilization tailored to any requirements.
Our framework achieves up to 34.7% improvements over existing methods in terms of representative NDCG metric.
arXiv Detail & Related papers (2025-02-12T12:01:08Z) - BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [64.8477128397529]
We propose a training-required and training-free test-time adaptation framework.
We maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.
We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets.
arXiv Detail & Related papers (2024-10-20T15:58:43Z) - UDA-Bench: Revisiting Common Assumptions in Unsupervised Domain Adaptation Using a Standardized Framework [59.428668614618914]
We take a deeper look into the diverse factors that influence the efficacy of modern unsupervised domain adaptation (UDA) methods.
To facilitate our analysis, we first develop UDA-Bench, a novel PyTorch framework that standardizes training and evaluation for domain adaptation.
arXiv Detail & Related papers (2024-09-23T17:57:07Z) - UniTTA: Unified Benchmark and Versatile Framework Towards Realistic Test-Time Adaptation [66.05528698010697]
Test-Time Adaptation aims to adapt pre-trained models to the target domain during testing.
Researchers have identified various challenging scenarios and developed diverse methods to address these challenges.
We propose a Unified Test-Time Adaptation benchmark, which is comprehensive and widely applicable.
arXiv Detail & Related papers (2024-07-29T15:04:53Z) - BEExAI: Benchmark to Evaluate Explainable AI [0.9176056742068812]
We propose BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods.
We argue that the need for a reliable way of measuring the quality and correctness of explanations is becoming critical.
arXiv Detail & Related papers (2024-07-29T11:21:17Z) - POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation [76.67608003501479]
We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators.
The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.
arXiv Detail & Related papers (2024-07-20T16:37:21Z) - Benchmarking Test-Time Adaptation against Distribution Shifts in Image
Classification [77.0114672086012]
Test-time adaptation (TTA) is a technique aimed at enhancing the generalization performance of models by leveraging unlabeled samples solely during prediction.
We present a benchmark that systematically evaluates 13 prominent TTA methods and their variants on five widely used image classification datasets.
arXiv Detail & Related papers (2023-07-06T16:59:53Z) - On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts.
We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z) - Can We Evaluate Domain Adaptation Models Without Target-Domain Labels? [36.05871459064825]
Unsupervised domain adaptation (UDA) involves adapting a model trained on a label-rich source domain to an unlabeled target domain.
In real-world scenarios, the absence of target-domain labels makes it challenging to evaluate the performance of UDA models.
We propose a novel metric called the textitTransfer Score to address these issues.
arXiv Detail & Related papers (2023-05-30T03:36:40Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.