Categorical exploratory data analysis on goodness-of-fit issues
- URL: http://arxiv.org/abs/2011.09682v2
- Date: Fri, 4 Dec 2020 01:41:15 GMT
- Title: Categorical exploratory data analysis on goodness-of-fit issues
- Authors: Sabrina Enriquez, Fushing Hsieh
- Abstract summary: We propose to utilize the data analysis paradigm called Categorical Exploratory Data Analysis (CEDA)
CEDA brings out where and how each data fits or deviates from the model shape via several important distributional aspects.
We make graphic display to illuminate the advantages of using CEDA as one primary way of data analysis in Data Science education.
- Score: 0.6091702876917279
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: If the aphorism "All models are wrong"- George Box, continues to be true in
data analysis, particularly when analyzing real-world data, then we should
annotate this wisdom with visible and explainable data-driven patterns. Such
annotations can critically shed invaluable light on validity as well as
limitations of statistical modeling as a data analysis approach. In an effort
to avoid holding our real data to potentially unattainable or even unrealistic
theoretical structures, we propose to utilize the data analysis paradigm called
Categorical Exploratory Data Analysis (CEDA). We illustrate the merits of this
proposal with two real-world data sets from the perspective of goodness-of-fit.
In both data sets, the Normal distribution's bell shape seemingly fits rather
well by first glance. We apply CEDA to bring out where and how each data fits
or deviates from the model shape via several important distributional aspects.
We also demonstrate that CEDA affords a version of tree-based p-value, and
compare it with p-values based on traditional statistical approaches. Along our
data analysis, we invest computational efforts in making graphic display to
illuminate the advantages of using CEDA as one primary way of data analysis in
Data Science education.
Related papers
- RealCQA-V2 : Visual Premise Proving [2.9201864249313383]
We introduce Visual Premise Proving, a novel task tailored to refine the process of chart question answering.
This approach represents a departure from conventional accuracy-based evaluation methods.
A model adept at reasoning is expected to demonstrate proficiency in both data retrieval and the structural understanding of charts.
arXiv Detail & Related papers (2024-10-29T19:32:53Z) - Visual Data Diagnosis and Debiasing with Concept Graphs [50.84781894621378]
Deep learning models often pick up inherent biases in the data during the training process, leading to unreliable predictions.
We present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets.
We show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks.
arXiv Detail & Related papers (2024-09-26T16:59:01Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Bayesian Federated Inference for Survival Models [0.0]
In cancer research, overall survival and progression free survival are often analyzed with the Cox model.
Merging data sets from different medical centers may help, but this is not always possible due to strict privacy legislation and logistic difficulties.
Recently, the Bayesian Federated Inference (BFI) strategy for generalized linear models was proposed.
arXiv Detail & Related papers (2024-04-26T15:05:26Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - PADME-SoSci: A Platform for Analytics and Distributed Machine Learning
for the Social Sciences [4.294774517325059]
PADME is a distributed analytics tool that federates model implementation and training.
It enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location.
arXiv Detail & Related papers (2023-03-27T15:32:35Z) - Why we should respect analysis results as data [0.0]
It is commonly overlooked that analyzing clinical study data also produces data in the form of results.
Although integrating and putting findings into context is a cornerstone of scientific work, analysis results are often neglected as a data source.
We propose a solution to "calculate once, use many times" by combining analysis results standards with a common data model.
arXiv Detail & Related papers (2022-04-21T08:34:07Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.