A Principled Approach to Failure Analysis and Model Repairment:
Demonstration in Medical Imaging
- URL: http://arxiv.org/abs/2109.12347v1
- Date: Sat, 25 Sep 2021 12:04:19 GMT
- Title: A Principled Approach to Failure Analysis and Model Repairment:
Demonstration in Medical Imaging
- Authors: Thomas Henn, Yasukazu Sakamoto, Cl\'ement Jacquet, Shunsuke Yoshizawa,
Masamichi Andou, Stephen Tchen, Ryosuke Saga, Hiroyuki Ishihara, Katsuhiko
Shimizu, Yingzhen Li and Ryutaro Tanno
- Abstract summary: Machine learning models commonly exhibit unexpected failures post-deployment.
We aim to standardise and bring principles to this process through answering two critical questions.
We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation.
We argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data.
- Score: 12.732665048388041
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine learning models commonly exhibit unexpected failures post-deployment
due to either data shifts or uncommon situations in the training environment.
Domain experts typically go through the tedious process of inspecting the
failure cases manually, identifying failure modes and then attempting to fix
the model. In this work, we aim to standardise and bring principles to this
process through answering two critical questions: (i) how do we know that we
have identified meaningful and distinct failure types?; (ii) how can we
validate that a model has, indeed, been repaired? We suggest that the quality
of the identified failure types can be validated through measuring the intra-
and inter-type generalisation after fine-tuning and introduce metrics to
compare different subtyping methods. Furthermore, we argue that a model can be
considered repaired if it achieves high accuracy on the failure types while
retaining performance on the previously correct data. We combine these two
ideas into a principled framework for evaluating the quality of both the
identified failure subtypes and model repairment. We evaluate its utility on a
classification and an object detection tasks. Our code is available at
https://github.com/Rokken-lab6/Failure-Analysis-and-Model-Repairment
Related papers
- Rethinking Early Stopping: Refine, Then Calibrate [49.966899634962374]
We show that calibration error and refinement error are not minimized simultaneously during training.
We introduce a new metric for early stopping and hyper parameter tuning that makes it possible to minimize refinement error during training.
Our method integrates seamlessly with any architecture and consistently improves performance across diverse classification tasks.
arXiv Detail & Related papers (2025-01-31T15:03:54Z) - ProcessBench: Identifying Process Errors in Mathematical Reasoning [62.80402845414901]
We introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning.
ProcessBench consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems.
We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models.
arXiv Detail & Related papers (2024-12-09T15:11:40Z) - Automatic Discovery and Assessment of Interpretable Systematic Errors in Semantic Segmentation [0.5242869847419834]
This paper presents a novel method for discovering systematic errors in segmentation models.
We leverage multimodal foundation models to retrieve errors and use conceptual linkage along with erroneous nature to study the systematic nature of these errors.
Our work opens up the avenue to model analysis and intervention that have so far been underexplored in semantic segmentation.
arXiv Detail & Related papers (2024-11-16T17:31:37Z) - DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation [18.77296551727931]
We propose DECIDER, a novel approach that leverages priors from large language models (LLMs) and vision-language models (VLMs) to detect failures in image models.
DECIDER consistently achieves state-of-the-art failure detection performance, significantly outperforming baselines in terms of the overall Matthews correlation coefficient.
arXiv Detail & Related papers (2024-08-01T07:08:11Z) - SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - Repairing Neural Networks by Leaving the Right Past Behind [23.78437548836594]
Prediction failures of machine learning models often arise from deficiencies in training data.
This work develops a generic framework for both identifying training examples that have given rise to the target failure, and fixing the model through erasing information about them.
arXiv Detail & Related papers (2022-07-11T12:07:39Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Defuse: Harnessing Unrestricted Adversarial Examples for Debugging
Models Beyond Test Accuracy [11.265020351747916]
Defuse is a method to automatically discover and correct model errors beyond those available in test data.
We propose an algorithm inspired by adversarial machine learning techniques that uses a generative model to find naturally occurring instances misclassified by a model.
Defuse corrects the error after fine-tuning while maintaining generalization on the test set.
arXiv Detail & Related papers (2021-02-11T18:08:42Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Debugging Tests for Model Explanations [18.073554618753395]
Methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples.
We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions.
arXiv Detail & Related papers (2020-11-10T22:23:25Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.