Evaluating Machine Learning Models with NERO: Non-Equivariance Revealed
on Orbits
- URL: http://arxiv.org/abs/2305.19889v1
- Date: Wed, 31 May 2023 14:24:35 GMT
- Title: Evaluating Machine Learning Models with NERO: Non-Equivariance Revealed
on Orbits
- Authors: Zhuokai Zhao, Takumi Matsuzawa, William Irvine, Michael Maire, Gordon
L Kindlmann
- Abstract summary: We propose a novel evaluation workflow, named Non-Equivariance Revealed on Orbits (NERO) Evaluation.
NERO evaluation is consist of a task-agnostic interactive interface and a set of visualizations, called NERO plots.
Case studies on how NERO evaluation can be applied to multiple research areas, including 2D digit recognition, object detection, particle image velocimetry (PIV), and 3D point cloud classification.
- Score: 19.45052971156096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proper evaluations are crucial for better understanding, troubleshooting,
interpreting model behaviors and further improving model performance. While
using scalar-based error metrics provides a fast way to overview model
performance, they are often too abstract to display certain weak spots and lack
information regarding important model properties, such as robustness. This not
only hinders machine learning models from being more interpretable and gaining
trust, but also can be misleading to both model developers and users.
Additionally, conventional evaluation procedures often leave researchers
unclear about where and how model fails, which complicates model comparisons
and further developments. To address these issues, we propose a novel
evaluation workflow, named Non-Equivariance Revealed on Orbits (NERO)
Evaluation. The goal of NERO evaluation is to turn focus from traditional
scalar-based metrics onto evaluating and visualizing models equivariance,
closely capturing model robustness, as well as to allow researchers quickly
investigating interesting or unexpected model behaviors. NERO evaluation is
consist of a task-agnostic interactive interface and a set of visualizations,
called NERO plots, which reveals the equivariance property of the model. Case
studies on how NERO evaluation can be applied to multiple research areas,
including 2D digit recognition, object detection, particle image velocimetry
(PIV), and 3D point cloud classification, demonstrate that NERO evaluation can
quickly illustrate different model equivariance, and effectively explain model
behaviors through interactive visualizations of the model outputs. In addition,
we propose consensus, an alternative to ground truths, to be used in NERO
evaluation so that model equivariance can still be evaluated with new,
unlabeled datasets.
Related papers
- Supervised Score-Based Modeling by Gradient Boosting [49.556736252628745]
We propose a Supervised Score-based Model (SSM) which can be viewed as a gradient boosting algorithm combining score matching.
We provide a theoretical analysis of learning and sampling for SSM to balance inference time and prediction accuracy.
Our model outperforms existing models in both accuracy and inference time.
arXiv Detail & Related papers (2024-11-02T07:06:53Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Artificial neural networks and time series of counts: A class of
nonlinear INGARCH models [0.0]
It is shown how INGARCH models can be combined with artificial neural network (ANN) response functions to obtain a class of nonlinear INGARCH models.
The ANN framework allows for the interpretation of many existing INGARCH models as a degenerate version of a corresponding neural model.
The empirical analysis of time series of bounded and unbounded counts reveals that the neural INGARCH models are able to outperform reasonable degenerate competitor models in terms of the information loss.
arXiv Detail & Related papers (2023-04-03T14:26:16Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Interpreting Black-box Machine Learning Models for High Dimensional
Datasets [40.09157165704895]
We train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed.
We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space.
Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets.
arXiv Detail & Related papers (2022-08-29T07:36:17Z) - Deep Learning Models for Knowledge Tracing: Review and Empirical
Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets.
The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z) - MDN-VO: Estimating Visual Odometry with Confidence [34.8860186009308]
Visual Odometry (VO) is used in many applications including robotics and autonomous systems.
We propose a deep learning-based VO model to estimate 6-DoF poses, as well as a confidence model for these estimates.
Our experiments show that the proposed model exceeds state-of-the-art performance in addition to detecting failure cases.
arXiv Detail & Related papers (2021-12-23T19:26:04Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Recoding latent sentence representations -- Dynamic gradient-based
activation modification in RNNs [0.0]
In RNNs, encoding information in a suboptimal way can impact the quality of representations based on later elements in the sequence.
I propose an augmentation to standard RNNs in form of a gradient-based correction mechanism.
I conduct different experiments in the context of language modeling, where the impact of using such a mechanism is examined in detail.
arXiv Detail & Related papers (2021-01-03T17:54:17Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.