Related papers: Explainable Mixed Data Representation and Lossless Visualization Toolkit for Knowledge Discovery

Explainable Mixed Data Representation and Lossless Visualization Toolkit for Knowledge Discovery

URL: http://arxiv.org/abs/2206.06476v1
Date: Mon, 13 Jun 2022 21:14:58 GMT
Title: Explainable Mixed Data Representation and Lossless Visualization Toolkit for Knowledge Discovery
Authors: Boris Kovalerchuk, Elijah McCoy
Abstract summary: Developing Machine Learning algorithms for heterogeneous/mixed data is a longstanding problem. Many ML algorithms are not applicable to mixed data, which include numeric and non-numeric data, text, graphs and so on. This paper presents a classification of mixed data types, analyzes their importance for ML and present the developed experimental toolkit to deal with mixed data.
Score: 7.005458308454871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Developing Machine Learning (ML) algorithms for heterogeneous/mixed data is a longstanding problem. Many ML algorithms are not applicable to mixed data, which include numeric and non-numeric data, text, graphs and so on to generate interpretable models. Another longstanding problem is developing algorithms for lossless visualization of multidimensional mixed data. The further progress in ML heavily depends on success interpretable ML algorithms for mixed data and lossless interpretable visualization of multidimensional data. The later allows developing interpretable ML models using visual knowledge discovery by end-users, who can bring valuable domain knowledge which is absent in the training data. The challenges for mixed data include: (1) generating numeric coding schemes for non-numeric attributes for numeric ML algorithms to provide accurate and interpretable ML models, (2) generating methods for lossless visualization of n-D non-numeric data and visual rule discovery in these visualizations. This paper presents a classification of mixed data types, analyzes their importance for ML and present the developed experimental toolkit to deal with mixed data. It combines the Data Types Editor, VisCanvas data visualization and rule discovery system which is available on GitHub.

Related papers

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance. We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z)
Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification, survival modelling, and regression analyses [0.0]
Medical Artificial Intelligence Toolbox (MAIT) is an explainable, open-source Python pipeline for developing and evaluating binary classification, regression, and survival models. MAIT addresses key challenges (e.g., high dimensionality, class imbalance, mixed variable types, and missingness) while promoting transparency in reporting. We provide detailed tutorials on GitHub, using four open-access data sets, to demonstrate how MAIT can be used to improve implementation and interpretation of ML models in medical research.
arXiv Detail & Related papers (2025-01-08T14:51:36Z)
Explainable Machine Learning for Categorical and Mixed Data with Lossless Visualization [3.4809730725241597]
This study proposes a classification of mixed data types and analyzes their important role in Machine Learning. It presents a toolkit for enforcing interpretability of all internal operations of ML algorithms on mixed data with a visual data exploration on mixed data. A new Sequential Rule Generation (SRG) algorithm for explainable rule generation with categorical data is proposed and successfully evaluated in multiple computational experiments.
arXiv Detail & Related papers (2023-05-29T00:41:32Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Integrating Transformer and Autoencoder Techniques with Spectral Graph Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge. Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder. The proposed models are validated using five benchmark data sets.
arXiv Detail & Related papers (2022-11-12T22:45:32Z)
Learning Mixtures of Linear Dynamical Systems [94.49754087817931]
We develop a two-stage meta-algorithm to efficiently recover each ground-truth LDS model up to error $tildeO(sqrtd/T)$. We validate our theoretical studies with numerical experiments, confirming the efficacy of the proposed algorithm.
arXiv Detail & Related papers (2022-01-26T22:26:01Z)
Distributionally Robust Semi-Supervised Learning Over Graphs [68.29280230284712]
Semi-supervised learning (SSL) over graph-structured data emerges in many network science applications. To efficiently manage learning over graphs, variants of graph neural networks (GNNs) have been developed recently. Despite their success in practice, most of existing methods are unable to handle graphs with uncertain nodal attributes. Challenges also arise due to distributional uncertainties associated with data acquired by noisy measurements. A distributionally robust learning framework is developed, where the objective is to train models that exhibit quantifiable robustness against perturbations.
arXiv Detail & Related papers (2021-10-20T14:23:54Z)
PyHard: a novel tool for generating hardness embeddings to support data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models. The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z)
An Introduction to Robust Graph Convolutional Networks [71.68610791161355]
We propose a novel Robust Graph Convolutional Neural Networks for possible erroneous single-view or multi-view data. By incorporating an extra layers via Autoencoders into traditional graph convolutional networks, we characterize and handle typical error models explicitly.
arXiv Detail & Related papers (2021-03-27T04:47:59Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
Visualisation and knowledge discovery from interpretable models [0.0]
We introduce a few intrinsically interpretable models which are also capable of dealing with missing values. We have demonstrated the algorithms on a synthetic dataset and a real-world one.
arXiv Detail & Related papers (2020-05-07T17:37:06Z)
Injective Domain Knowledge in Neural Networks for Transprecision Computing [17.300144121921882]
This paper studies the improvements that can be obtained by integrating prior knowledge when dealing with a non-trivial learning task. The results clearly show that ML models exploiting problem-specific information outperform the purely data-driven ones, with an average accuracy improvement around 38%.
arXiv Detail & Related papers (2020-02-24T12:58:56Z)
Data Augmentation for Histopathological Images Based on Gaussian-Laplacian Pyramid Blending [59.91656519028334]
Data imbalance is a major problem that affects several machine learning (ML) algorithms. In this paper, we propose a novel approach capable of not only augmenting HI dataset but also distributing the inter-patient variability. Experimental results on the BreakHis dataset have shown promising gains vis-a-vis the majority of DA techniques presented in the literature.
arXiv Detail & Related papers (2020-01-31T22:02:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.