Accountable Error Characterization
- URL: http://arxiv.org/abs/2105.04707v1
- Date: Mon, 10 May 2021 23:40:01 GMT
- Title: Accountable Error Characterization
- Authors: Amita Misra, Zhe Liu and Jalal Mahmud
- Abstract summary: We propose an accountable error characterization method, AEC, to understand when and where errors occur.
We perform error detection for a sentiment analysis task using AEC as a case study.
- Score: 7.830479195591646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Customers of machine learning systems demand accountability from the
companies employing these algorithms for various prediction tasks.
Accountability requires understanding of system limit and condition of
erroneous predictions, as customers are often interested in understanding the
incorrect predictions, and model developers are absorbed in finding methods
that can be used to get incremental improvements to an existing system.
Therefore, we propose an accountable error characterization method, AEC, to
understand when and where errors occur within the existing black-box models.
AEC, as constructed with human-understandable linguistic features, allows the
model developers to automatically identify the main sources of errors for a
given classification system. It can also be used to sample for the set of most
informative input points for a next round of training. We perform error
detection for a sentiment analysis task using AEC as a case study. Our results
on the sample sentiment task show that AEC is able to characterize erroneous
predictions into human understandable categories and also achieves promising
results on selecting erroneous samples when compared with the uncertainty-based
sampling.
Related papers
- Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Representing Timed Automata and Timing Anomalies of Cyber-Physical
Production Systems in Knowledge Graphs [51.98400002538092]
This paper aims to improve model-based anomaly detection in CPPS by combining the learned timed automaton with a formal knowledge graph about the system.
Both the model and the detected anomalies are described in the knowledge graph in order to allow operators an easier interpretation of the model and the detected anomalies.
arXiv Detail & Related papers (2023-08-25T15:25:57Z) - Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes [72.13373216644021]
We study the societal impact of machine learning by considering the collection of models that are deployed in a given context.
We find deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available.
These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
arXiv Detail & Related papers (2023-07-12T01:11:52Z) - Discovering and Validating AI Errors With Crowdsourced Failure Reports [10.4818618376202]
We introduce crowdsourced failure reports, end-user descriptions of how or why a model failed, and show how developers can use them to detect AI errors.
We also design and implement Deblinder, a visual analytics system for synthesizing failure reports.
In semi-structured interviews and think-aloud studies with 10 AI practitioners, we explore the affordances of the Deblinder system and the applicability of failure reports in real-world settings.
arXiv Detail & Related papers (2021-09-23T23:26:59Z) - Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation [25.325624543852086]
We propose a general methodology for adversarial testing of Quality Estimation for Machine Translation (MT) systems.
We show that despite a high correlation with human judgements achieved by the recent SOTA, certain types of meaning errors are still problematic for QE to detect.
Second, we show that on average, the ability of a given model to discriminate between meaning-preserving and meaning-altering perturbations is predictive of its overall performance.
arXiv Detail & Related papers (2021-09-22T17:32:18Z) - Translation Error Detection as Rationale Extraction [36.616561917049076]
We study the behaviour of state-of-the-art sentence-level QE models and show that explanations can indeed be used to detect translation errors.
We introduce a novel semi-supervised method for word-level QE and (ii) propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution.
arXiv Detail & Related papers (2021-08-27T09:35:14Z) - When and Why does a Model Fail? A Human-in-the-loop Error Detection
Framework for Sentiment Analysis [12.23497603132782]
We propose an error detection framework for sentiment analysis based on explainable features.
Experimental results show that, given limited human-in-the-loop intervention, our method is able to identify erroneous model predictions on unseen data with high precision.
arXiv Detail & Related papers (2021-06-02T05:45:42Z) - A Bayesian Approach to Identifying Representational Errors [19.539720986687524]
We present a generative model for inferring representational errors based on observations of an actor's behavior.
We show that our approach can recover blind spots of both reinforcement learning agents as well as human users.
arXiv Detail & Related papers (2021-03-28T16:43:01Z) - Distribution-Free, Risk-Controlling Prediction Sets [112.9186453405701]
We show how to generate set-valued predictions from a black-box predictor that control the expected loss on future test points at a user-specified level.
Our approach provides explicit finite-sample guarantees for any dataset by using a holdout set to calibrate the size of the prediction sets.
arXiv Detail & Related papers (2021-01-07T18:59:33Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.