Explainable Machine Learning for Categorical and Mixed Data with
Lossless Visualization
- URL: http://arxiv.org/abs/2305.18437v3
- Date: Thu, 23 Nov 2023 01:49:02 GMT
- Title: Explainable Machine Learning for Categorical and Mixed Data with
Lossless Visualization
- Authors: Boris Kovalerchuk, Elijah McCoy
- Abstract summary: This study proposes a classification of mixed data types and analyzes their important role in Machine Learning.
It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data.
A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments.
- Score: 3.4809730725241597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building accurate and interpretable Machine Learning (ML) models for
heterogeneous/mixed data is a long-standing challenge for algorithms designed
for numeric data. This work focuses on developing numeric coding schemes for
non-numeric attributes for ML algorithms to support accurate and explainable ML
models, methods for lossless visualization of n-D non-numeric categorical data
with visual rule discovery in these visualizations, and accurate and
explainable ML models for categorical data. This study proposes a
classification of mixed data types and analyzes their important role in Machine
Learning. It presents a toolkit for enforcing interpretability of all internal
operations of ML algorithms on mixed data with a visual data exploration on
mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable
rule generation with categorical data is proposed and successfully evaluated in
multiple computational experiments. This work is one of the steps to the full
scope ML algorithms for mixed data supported by lossless visualization of n-D
data in General Line Coordinates beyond Parallel Coordinates.
Related papers
- Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance.
We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features.
In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z) - Minimally Informed Linear Discriminant Analysis: training an LDA model
with unlabelled data [51.673443581397954]
We show that it is possible to compute the exact projection vector from LDA models based on unlabelled data.
We show that the MILDA projection vector can be computed in a closed form with a computational cost comparable to LDA.
arXiv Detail & Related papers (2023-10-17T09:50:31Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - Explainable Mixed Data Representation and Lossless Visualization Toolkit
for Knowledge Discovery [7.005458308454871]
Developing Machine Learning algorithms for heterogeneous/mixed data is a longstanding problem.
Many ML algorithms are not applicable to mixed data, which include numeric and non-numeric data, text, graphs and so on.
This paper presents a classification of mixed data types, analyzes their importance for ML and present the developed experimental toolkit to deal with mixed data.
arXiv Detail & Related papers (2022-06-13T21:14:58Z) - Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via
Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion.
We propose a general framework that combines disparate data types through the exponential family of distributions.
The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z) - Self-service Data Classification Using Interactive Visualization and
Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm.
IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain.
This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z) - Machine Learning Pipeline for Pulsar Star Dataset [58.720142291102135]
This work brings together some of the most common machine learning (ML) algorithms.
The objective is to make a comparison at the level of obtained results from a set of unbalanced data.
arXiv Detail & Related papers (2020-05-03T23:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.