On the Importance of Application-Grounded Experimental Design for
Evaluating Explainable ML Methods
- URL: http://arxiv.org/abs/2206.13503v2
- Date: Tue, 28 Jun 2022 13:40:29 GMT
- Title: On the Importance of Application-Grounded Experimental Design for
Evaluating Explainable ML Methods
- Authors: Kasun Amarasinghe, Kit T. Rodolfa, S\'ergio Jesus, Valerie Chen,
Vladimir Balayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, Rayid Ghani
- Abstract summary: We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting.
Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results.
We believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
- Score: 20.2027063607352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning (ML) models now inform a wide range of human decisions, but
using ``black box'' models carries risks such as relying on spurious
correlations or errant data. To address this, researchers have proposed methods
for supplementing models with explanations of their predictions. However,
robust evaluations of these methods' usefulness in real-world contexts have
remained elusive, with experiments tending to rely on simplified settings or
proxy tasks. We present an experimental study extending a prior explainable ML
evaluation experiment and bringing the setup closer to the deployment setting
by relaxing its simplifying assumptions. Our empirical study draws dramatically
different conclusions than the prior work, highlighting how seemingly trivial
experimental design choices can yield misleading results. Beyond the present
experiment, we believe this work holds lessons about the necessity of situating
the evaluation of any ML method and choosing appropriate tasks, data, users,
and metrics to match the intended deployment contexts.
Related papers
- Estimating Causal Effects with Double Machine Learning -- A Method Evaluation [5.904095466127043]
We review one of the most prominent methods - "double/debiased machine learning" (DML)
Our findings indicate that the application of a suitably flexible machine learning algorithm within DML improves the adjustment for various nonlinear confounding relationships.
When estimating the effects of air pollution on housing prices, we find that DML estimates are consistently larger than estimates of less flexible methods.
arXiv Detail & Related papers (2024-03-21T13:21:33Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Adaptive Instrument Design for Indirect Experiments [48.815194906471405]
Unlike RCTs, indirect experiments estimate treatment effects by leveragingconditional instrumental variables.
In this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy.
Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy.
arXiv Detail & Related papers (2023-12-05T02:38:04Z) - A Double Machine Learning Approach to Combining Experimental and Observational Data [59.29868677652324]
We propose a double machine learning approach to combine experimental and observational studies.
Our framework tests for violations of external validity and ignorability under milder assumptions.
arXiv Detail & Related papers (2023-07-04T02:53:11Z) - Intervention Generalization: A View from Factor Graph Models [7.117681268784223]
We take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system.
A postulated $textitinterventional factor model$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms.
arXiv Detail & Related papers (2023-06-06T21:44:23Z) - Leaving the Nest: Going Beyond Local Loss Functions for
Predict-Then-Optimize [57.22851616806617]
We show that our method achieves state-of-the-art results in four domains from the literature.
Our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.
arXiv Detail & Related papers (2023-05-26T11:17:45Z) - Testing Occupational Gender Bias in Language Models: Towards Robust Measurement and Zero-Shot Debiasing [98.07536837448293]
Large language models (LLMs) have been shown to exhibit a variety of harmful, human-like biases against various demographics.
We introduce a list of desiderata for robustly measuring biases in generative language models.
We then use this benchmark to test several state-of-the-art open-source LLMs, including Llama, Mistral, and their instruction-tuned versions.
arXiv Detail & Related papers (2022-12-20T22:41:24Z) - Efficient Real-world Testing of Causal Decision Making via Bayesian
Experimental Design for Contextual Optimisation [12.37745209793872]
We introduce a model-agnostic framework for gathering data to evaluate and improve contextual decision making.
Our method is used for the data-efficient evaluation of the regret of past treatment assignments.
arXiv Detail & Related papers (2022-07-12T01:20:11Z) - Bayesian Optimal Experimental Design for Simulator Models of Cognition [14.059933880568908]
We combine recent advances in BOED and approximate inference for intractable models to find optimal experimental designs.
Our simulation experiments on multi-armed bandit tasks show that our method results in improved model discrimination and parameter estimation.
arXiv Detail & Related papers (2021-10-29T09:04:01Z) - Predicting Performance for Natural Language Processing Tasks [128.34208911925424]
We build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input.
Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures.
arXiv Detail & Related papers (2020-05-02T16:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.