KAXAI: An Integrated Environment for Knowledge Analysis and Explainable
AI
- URL: http://arxiv.org/abs/2401.00193v1
- Date: Sat, 30 Dec 2023 10:20:47 GMT
- Title: KAXAI: An Integrated Environment for Knowledge Analysis and Explainable
AI
- Authors: Saikat Barua, Dr. Sifat Momen
- Abstract summary: The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation.
The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In order to fully harness the potential of machine learning, it is crucial to
establish a system that renders the field more accessible and less daunting for
individuals who may not possess a comprehensive understanding of its
intricacies. The paper describes the design of a system that integrates AutoML,
XAI, and synthetic data generation to provide a great UX design for users. The
system allows users to navigate and harness the power of machine learning while
abstracting its complexities and providing high usability. The paper proposes
two novel classifiers, Logistic Regression Forest and Support Vector Tree, for
enhanced model performance, achieving 96\% accuracy on a diabetes dataset and
93\% on a survey dataset. The paper also introduces a model-dependent local
interpreter called MEDLEY and evaluates its interpretation against LIME,
Greedy, and Parzen. Additionally, the paper introduces LLM-based synthetic data
generation, library-based data generation, and enhancing the original dataset
with GAN. The findings on synthetic data suggest that enhancing the original
dataset with GAN is the most reliable way to generate synthetic data, as
evidenced by KS tests, standard deviation, and feature importance. The authors
also found that GAN works best for quantitative datasets.
Related papers
- Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.
We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.
At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z) - Efficient Multi-Agent System Training with Data Influence-Oriented Tree Search [59.75749613951193]
We propose Data Influence-oriented Tree Search (DITS) to guide both tree search and data selection.
By leveraging influence scores, we effectively identify the most impactful data for system improvement.
We derive influence score estimation methods tailored for non-differentiable metrics.
arXiv Detail & Related papers (2025-02-02T23:20:16Z) - Creating Artificial Students that Never Existed: Leveraging Large Language Models and CTGANs for Synthetic Data Generation [2.4374097382908477]
We investigate whether synthetic data can be leveraged to create artificial students for serving learning analytics models.
Our results demonstrate the strong potential of these methods to produce high-quality synthetic datasets that resemble real students data.
arXiv Detail & Related papers (2025-01-03T12:52:51Z) - Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - FLIGAN: Enhancing Federated Learning with Incomplete Data using GAN [1.5749416770494706]
Federated Learning (FL) provides a privacy-preserving mechanism for distributed training of machine learning models on networked devices.
We propose FLIGAN, a novel approach to address the issue of data incompleteness in FL.
Our methodology adheres to FL's privacy requirements by generating synthetic data in a federated manner without sharing the actual data in the process.
arXiv Detail & Related papers (2024-03-25T16:49:38Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic Benchmarks for Scientific Research in Explainable Machine
Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms.
Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values.
We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.