The effectiveness of feature attribution methods and its correlation
with automatic evaluation scores
- URL: http://arxiv.org/abs/2105.14944v1
- Date: Mon, 31 May 2021 13:23:50 GMT
- Title: The effectiveness of feature attribution methods and its correlation
with automatic evaluation scores
- Authors: Giang Nguyen, Daeyoung Kim, Anh Nguyen
- Abstract summary: We conduct the first, large-scale user study on 320 lay and 11 expert users to shed light on the effectiveness of state-of-the-art attribution methods.
We found that, in overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples.
- Score: 19.71360639210631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Explaining the decisions of an Artificial Intelligence (AI) model is
increasingly critical in many real-world, high-stake applications. Hundreds of
papers have either proposed new feature attribution methods, discussed or
harnessed these tools in their work. However, despite humans being the target
end-users, most attribution methods were only evaluated on proxy
automatic-evaluation metrics. In this paper, we conduct the first, large-scale
user study on 320 lay and 11 expert users to shed light on the effectiveness of
state-of-the-art attribution methods in assisting humans in ImageNet
classification, Stanford Dogs fine-grained classification, and these two tasks
but when the input image contains adversarial perturbations. We found that, in
overall, feature attribution is surprisingly not more effective than showing
humans nearest training-set examples. On a hard task of fine-grained dog
categorization, presenting attribution maps to humans does not help, but
instead hurts the performance of human-AI teams compared to AI alone.
Importantly, we found automatic attribution-map evaluation measures to
correlate poorly with the actual human-AI team performance. Our findings
encourage the community to rigorously test their methods on the downstream
human-in-the-loop applications and to rethink the existing evaluation metrics.
Related papers
- Beyond Static Evaluation: A Dynamic Approach to Assessing AI Assistants' API Invocation Capabilities [48.922660354417204]
We propose Automated Dynamic Evaluation (AutoDE) to assess an assistant's API call capability without human involvement.
In our framework, we endeavor to closely mirror genuine human conversation patterns in human-machine interactions.
arXiv Detail & Related papers (2024-03-17T07:34:12Z) - Automating the Correctness Assessment of AI-generated Code for Security Contexts [8.009107843106108]
We propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes.
We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code.
Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation.
arXiv Detail & Related papers (2023-10-28T22:28:32Z) - Improving Human-AI Collaboration With Descriptions of AI Behavior [14.904401331154062]
People work with AI systems to improve their decision making, but often under- or over-rely on AI predictions and perform worse than they would have unassisted.
To help people appropriately rely on AI aids, we propose showing them behavior descriptions.
arXiv Detail & Related papers (2023-01-06T00:33:08Z) - Advancing Human-AI Complementarity: The Impact of User Expertise and
Algorithmic Tuning on Joint Decision Making [10.890854857970488]
Many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more.
Our study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel was flowing or stalled.
Our results show that while recommendations from an AI-Assistant can aid user decision making, factors such as users' baseline performance relative to the AI and complementary tuning of AI error types significantly impact overall team performance.
arXiv Detail & Related papers (2022-08-16T21:39:58Z) - ACP++: Action Co-occurrence Priors for Human-Object Interaction
Detection [102.9428507180728]
A common problem in the task of human-object interaction (HOI) detection is that numerous HOI classes have only a small number of labeled examples.
We observe that there exist natural correlations and anti-correlations among human-object interactions.
We present techniques to learn these priors and leverage them for more effective training, especially on rare classes.
arXiv Detail & Related papers (2021-09-09T06:02:50Z) - Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods.
Our method is inspired by a human computation game, "Peek-a-boom"
We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Detecting Human-Object Interactions with Action Co-occurrence Priors [108.31956827512376]
A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples.
We observe that there exist natural correlations and anti-correlations among human-object interactions.
We present techniques to learn these priors and leverage them for more effective training, especially in rare classes.
arXiv Detail & Related papers (2020-07-17T02:47:45Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems [14.940404609343432]
We evaluate two currently common techniques for evaluating XAI systems.
We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks.
Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
arXiv Detail & Related papers (2020-01-22T22:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.