Related papers: Use-Case-Grounded Simulations for Explanation Evaluation

Use-Case-Grounded Simulations for Explanation Evaluation

URL: http://arxiv.org/abs/2206.02256v1
Date: Sun, 5 Jun 2022 20:12:19 GMT
Title: Use-Case-Grounded Simulations for Explanation Evaluation
Authors: Valerie Chen, Nari Johnson, Nicholay Topin, Gregory Plumb, Ameet Talwalkar
Abstract summary: We introduce Use-Case-Grounded Simulated Evaluations (SimEvals) SimEvals involve training algorithmic agents that take as input the information content that would be presented to each participant in a human subject study. We run a comprehensive evaluation on three real-world use cases to demonstrate that Simevals can effectively identify which explanation methods will help humans for each use case.
Score: 23.584251632331046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A growing body of research runs human subject evaluations to study whether providing users with explanations of machine learning models can help them with practical real-world use cases. However, running user studies is challenging and costly, and consequently each study typically only evaluates a limited number of different settings, e.g., studies often only evaluate a few arbitrarily selected explanation methods. To address these challenges and aid user study design, we introduce Use-Case-Grounded Simulated Evaluations (SimEvals). SimEvals involve training algorithmic agents that take as input the information content (such as model explanations) that would be presented to each participant in a human subject study, to predict answers to the use case of interest. The algorithmic agent's test set accuracy provides a measure of the predictiveness of the information content for the downstream use case. We run a comprehensive evaluation on three real-world use cases (forward simulation, model debugging, and counterfactual reasoning) to demonstrate that Simevals can effectively identify which explanation methods will help humans for each use case. These results provide evidence that SimEvals can be used to efficiently screen an important set of user study design decisions, e.g. selecting which explanations should be presented to the user, before running a potentially costly user study.

Related papers

Evaluating Contrastive Feedback for Effective User Simulations [2.8089969618577997]
This study explores whether the underlying principles of contrastive training techniques can be applied beneficially in the area of prompt engineering for user simulations.<n>The primary objective of this study is to analyze how different modalities of contextual information influence the effectiveness of user simulations.
arXiv Detail & Related papers (2025-05-05T11:02:31Z)
Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations [49.908708778200115]
We are the first to specialize large language models (LLMs) for simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions.
arXiv Detail & Related papers (2025-02-10T21:59:27Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine Learning [18.965568482077344]
We introduce a generalizable, cost-effective method for identifying task-relevant explanation properties in silico. We use our approach to identify relevant proxies for three example tasks and validate our simulation with real user studies.
arXiv Detail & Related papers (2024-05-31T18:08:35Z)
BASES: Large-scale Web Search User Simulation with Large Language Model based Agents [108.97507653131917]
BASES is a novel user simulation framework with large language models (LLMs) Our simulation framework can generate unique user profiles at scale, which subsequently leads to diverse search behaviors. WARRIORS is a new large-scale dataset encompassing web search user behaviors, including both Chinese and English versions.
arXiv Detail & Related papers (2024-02-27T13:44:09Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization. We propose Unified Multi-scenario Summarization Evaluation Model (UMSE) Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z)
Designing Optimal Behavioral Experiments Using Machine Learning [8.759299724881219]
We provide a tutorial on leveraging recent advances in BOED and machine learning to find optimal experiments for any kind of model. We consider theories of how people balance exploration and exploitation in multi-armed bandit decision-making tasks. As compared to experimental designs commonly used in the literature, we show that our optimal designs more efficiently determine which of a set of models best account for individual human behavior.
arXiv Detail & Related papers (2023-05-12T18:24:30Z)
A Case Study on Designing Evaluations of ML Explanations with Simulated User Studies [6.2511886555343805]
We conduct the first SimEvals on a real-world use case to evaluate whether explanations can better support ML-assisted decision-making in e-commerce fraud detection. We find that SimEvals suggest that all considered explainers are equally performant, and none beat a baseline without explanations.
arXiv Detail & Related papers (2023-02-15T03:27:55Z)
Challenging common interpretability assumptions in feature attribution explanations [0.0]
We empirically evaluate the veracity of three common interpretability assumptions through a large scale human-subjects experiment. We find that feature attribution explanations provide marginal utility in our task for a human decision maker.
arXiv Detail & Related papers (2020-12-04T17:57:26Z)
A User's Guide to Calibrating Robotics Simulators [54.85241102329546]
This paper proposes a set of benchmarks and a framework for the study of various algorithms aimed to transfer models and policies learnt in simulation to the real world. We conduct experiments on a wide range of well known simulated environments to characterize and offer insights into the performance of different algorithms. Our analysis can be useful for practitioners working in this area and can help make informed choices about the behavior and main properties of sim-to-real algorithms.
arXiv Detail & Related papers (2020-11-17T22:24:26Z)
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability. Clear evidence of method effectiveness is found in very few cases. Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.