On the Importance of Application-Grounded Experimental Design for
Evaluating Explainable ML Methods
- URL: http://arxiv.org/abs/2206.13503v2
- Date: Tue, 28 Jun 2022 13:40:29 GMT
- Title: On the Importance of Application-Grounded Experimental Design for
Evaluating Explainable ML Methods
- Authors: Kasun Amarasinghe, Kit T. Rodolfa, S\'ergio Jesus, Valerie Chen,
Vladimir Balayan, Pedro Saleiro, Pedro Bizarro, Ameet Talwalkar, Rayid Ghani
- Abstract summary: We present an experimental study extending a prior explainable ML evaluation experiment and bringing the setup closer to the deployment setting.
Our empirical study draws dramatically different conclusions than the prior work, highlighting how seemingly trivial experimental design choices can yield misleading results.
We believe this work holds lessons about the necessity of situating the evaluation of any ML method and choosing appropriate tasks, data, users, and metrics to match the intended deployment contexts.
- Score: 20.2027063607352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning (ML) models now inform a wide range of human decisions, but
using ``black box'' models carries risks such as relying on spurious
correlations or errant data. To address this, researchers have proposed methods
for supplementing models with explanations of their predictions. However,
robust evaluations of these methods' usefulness in real-world contexts have
remained elusive, with experiments tending to rely on simplified settings or
proxy tasks. We present an experimental study extending a prior explainable ML
evaluation experiment and bringing the setup closer to the deployment setting
by relaxing its simplifying assumptions. Our empirical study draws dramatically
different conclusions than the prior work, highlighting how seemingly trivial
experimental design choices can yield misleading results. Beyond the present
experiment, we believe this work holds lessons about the necessity of situating
the evaluation of any ML method and choosing appropriate tasks, data, users,
and metrics to match the intended deployment contexts.
Related papers
- Using LLMs for Explaining Sets of Counterfactual Examples to Final Users [0.0]
In automated decision-making scenarios, causal inference methods can analyze the underlying data-generation process.
Counterfactual examples explore hypothetical scenarios where a minimal number of factors are altered.
We propose a novel multi-step pipeline that uses counterfactuals to generate natural language explanations of actions that will lead to a change in outcome.
arXiv Detail & Related papers (2024-08-27T15:13:06Z) - Simulating Field Experiments with Large Language Models [0.6144680854063939]
This paper pioneers the utilization of large language models (LLMs) for simulating field experiments.
By introducing two novel prompting strategies, observer and participant modes, we demonstrate the ability of LLMs to both predict outcomes and replicate participant responses within complex field settings.
Our findings indicate a promising alignment with actual experimental results in certain scenarios, achieving a stimulation accuracy of 66% in observer mode.
arXiv Detail & Related papers (2024-08-19T03:41:43Z) - C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations.
Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Adaptive Instrument Design for Indirect Experiments [48.815194906471405]
Unlike RCTs, indirect experiments estimate treatment effects by leveragingconditional instrumental variables.
In this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy.
Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy.
arXiv Detail & Related papers (2023-12-05T02:38:04Z) - A Double Machine Learning Approach to Combining Experimental and Observational Data [59.29868677652324]
We propose a double machine learning approach to combine experimental and observational studies.
Our framework tests for violations of external validity and ignorability under milder assumptions.
arXiv Detail & Related papers (2023-07-04T02:53:11Z) - Intervention Generalization: A View from Factor Graph Models [7.117681268784223]
We take a close look at how to warrant a leap from past experiments to novel conditions based on minimal assumptions about the factorization of the distribution of the manipulated system.
A postulated $textitinterventional factor model$ (IFM) may not always be informative, but it conveniently abstracts away a need for explicitly modeling unmeasured confounding and feedback mechanisms.
arXiv Detail & Related papers (2023-06-06T21:44:23Z) - Leaving the Nest: Going Beyond Local Loss Functions for
Predict-Then-Optimize [57.22851616806617]
We show that our method achieves state-of-the-art results in four domains from the literature.
Our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.
arXiv Detail & Related papers (2023-05-26T11:17:45Z) - Bayesian Optimal Experimental Design for Simulator Models of Cognition [14.059933880568908]
We combine recent advances in BOED and approximate inference for intractable models to find optimal experimental designs.
Our simulation experiments on multi-armed bandit tasks show that our method results in improved model discrimination and parameter estimation.
arXiv Detail & Related papers (2021-10-29T09:04:01Z) - Predicting Performance for Natural Language Processing Tasks [128.34208911925424]
We build regression models to predict the evaluation score of an NLP experiment given the experimental settings as input.
Experimenting on 9 different NLP tasks, we find that our predictors can produce meaningful predictions over unseen languages and different modeling architectures.
arXiv Detail & Related papers (2020-05-02T16:02:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.