Related papers: Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation

Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation

URL: http://arxiv.org/abs/2412.17066v1
Date: Sun, 22 Dec 2024 15:36:15 GMT
Title: Interactive Classification Metrics: A graphical application to build robust intuition for classification model evaluation
Authors: David H. Brown, Davide Chicco,
Abstract summary: Interactive Classification Metrics (ICM) is an application to visualize and explore the relationships between different evaluation metrics.<n>The user changes the distribution statistics and explores corresponding changes across a suite of evaluation metrics.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning continues to grow in popularity in academia, in industry, and is increasingly used in other fields. However, most of the common metrics used to evaluate even simple binary classification models have shortcomings that are neither immediately obvious nor consistently taught to practitioners. Here we present Interactive Classification Metrics (ICM), an application to visualize and explore the relationships between different evaluation metrics. The user changes the distribution statistics and explores corresponding changes across a suite of evaluation metrics. The interactive, graphical nature of this tool emphasizes the tradeoffs of each metric without the overhead of data wrangling and model training. The goals of this application are: (1) to aid practitioners in the ever-expanding machine learning field to choose the most appropriate evaluation metrics for their classification problem; (2) to promote careful attention to interpretation that is required even in the simplest scenarios like binary classification. Our application is publicly available for free under the MIT license as a Python package on PyPI at https://pypi.org/project/interactive-classification-metrics and on GitHub at https://github.com/davhbrown/interactive_classification_metrics.

Related papers

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering [52.19512723549318]
We design a scalable human evaluation protocol that reflects practitioners' real-world usage of models.<n>We use this protocol to collect extensive crowdworker annotations of outputs from a diverse set of topic models.<n>We then use these annotations to validate automated proxies, finding that the best LLM proxies are statistically indistinguishable from a human annotator.
arXiv Detail & Related papers (2025-07-01T15:00:55Z)
OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning [86.20909814421748]
Real-world scenarios require models to handle inputs without prior domain knowledge.<n>We propose OpenworldAUC, a metric that assesses detection and classification through pairwise instance comparisons.<n> Experiments on 15 benchmarks in open-world scenarios show OpenworldAUC achieves SOTA performance on OpenworldAUC and other metrics.
arXiv Detail & Related papers (2025-05-08T12:31:40Z)
Improving Applicability of Deep Learning based Token Classification models during Training [0.0]
We show that classification metrics, represented by the F1-Score, are insufficient for evaluating the applicability of machine learning models in practice. We introduce a novel metric, Document Integrity Precision (DIP), as a solution for visual document understanding and the token classification task.
arXiv Detail & Related papers (2025-03-28T17:01:19Z)
MLMC: Interactive multi-label multi-classifier evaluation without confusion matrices [52.476815843373515]
Machine-C is a visual exploration tool that tackles the challenge of multi-label comparison and evaluation. Our study shows that the techniques implemented by Machine-C allow for a powerful multi-label classifier evaluation while preserving user friendliness.
arXiv Detail & Related papers (2025-01-24T12:43:36Z)
kNN Classification of Malware Data Dependency Graph Features [0.0]
This study obtains accurate classification from the use of features tied to structure and semantics. By training an accurate model using labeled data, this feature representation of semantics is shown to be correlated with ground truth labels. Our results provide evidence that data dependency graphs accurately capture both semantic and structural information for increased explainability in classification results.
arXiv Detail & Related papers (2024-06-04T16:39:02Z)
$F_β$-plot -- a visual tool for evaluating imbalanced data classifiers [0.0]
The paper proposes a simple approach to analyzing the popular parametric metric $F_beta$. It is possible to indicate for a given pool of analyzed classifiers when a given model should be preferred depending on user requirements.
arXiv Detail & Related papers (2024-04-11T18:07:57Z)
Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning [2.2060666847121864]
Model evaluation is a critical component in supervised machine learning classification analyses. Items Response Theory (IRT) and Computer Adaptive Testing (CAT) with machine learning can benchmark datasets independent of the end-classification results.
arXiv Detail & Related papers (2023-02-09T00:38:42Z)
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models [76.01814380927507]
KGxBoard is an interactive framework for performing fine-grained evaluation on meaningful subsets of the data. In our experiments, we highlight the findings with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics.
arXiv Detail & Related papers (2022-08-23T15:11:45Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)
Classification Performance Metric Elicitation and its Applications [5.5637552942511155]
Despite its practical interest, there is limited formal guidance on how to select metrics for machine learning applications. This thesis outlines metric elicitation as a principled framework for selecting the performance metric that best reflects implicit user preferences.
arXiv Detail & Related papers (2022-08-19T03:57:17Z)
A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers. By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation. To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z)
A Unified Framework for Rank-based Evaluation Metrics for Link Prediction in Knowledge Graphs [19.822126244784133]
Link prediction task on knowledge graphs without explicit negative triples motivates the usage of rank-based metrics. We introduce a simple theoretical framework for rank-based metrics upon which we investigate two avenues for improvements to existing metrics via alternative aggregation functions and concepts from probability theory. We propose several new rank-based metrics that are more easily interpreted and compared accompanied by a demonstration of their usage in a benchmarking of knowledge graph embedding models.
arXiv Detail & Related papers (2022-03-14T23:09:46Z)
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards) Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval. We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions. Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z)
Fine-Grained Visual Classification with Efficient End-to-end Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup. We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z)
PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems. Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective. We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.