A system for exploring big data: an iterative k-means searchlight for
outlier detection on open health data
- URL: http://arxiv.org/abs/2304.02189v1
- Date: Wed, 5 Apr 2023 02:09:15 GMT
- Title: A system for exploring big data: an iterative k-means searchlight for
outlier detection on open health data
- Authors: A. Ravishankar Rao, Daniel Clarke, Subrata Garai, Soumyabrata Dey
- Abstract summary: We present a system that explores multiple combinations of variables using a searchlight technique and identifies outliers.
We illustrate this system by anaylzing open health care data released by New York State.
Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides.
- Score: 0.4588028371034407
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The interactive exploration of large and evolving datasets is challenging as
relationships between underlying variables may not be fully understood. There
may be hidden trends and patterns in the data that are worthy of further
exploration and analysis. We present a system that methodically explores
multiple combinations of variables using a searchlight technique and identifies
outliers. An iterative k-means clustering algorithm is applied to features
derived through a split-apply-combine paradigm used in the database literature.
Outliers are identified as singleton or small clusters. This algorithm is swept
across the dataset in a searchlight manner. The dimensions that contain
outliers are combined in pairs with other dimensions using a susbset scan
technique to gain further insight into the outliers. We illustrate this system
by anaylzing open health care data released by New York State. We apply our
iterative k-means searchlight followed by subset scanning. Several anomalous
trends in the data are identified, including cost overruns at specific
hospitals, and increases in diagnoses such as suicides. These constitute novel
findings in the literature, and are of potential use to regulatory agencies,
policy makers and concerned citizens.
Related papers
- DeepHYDRA: Resource-Efficient Time-Series Anomaly Detection in Dynamically-Configured Systems [3.44012349879073]
We present DeepHYDRA (Deep Hybrid DBSCAN/Reduction-Based Anomaly Detection)
It combines DBSCAN and learning-based anomaly detection.
It is shown to reliably detect different types of anomalies in both large and complex datasets.
arXiv Detail & Related papers (2024-05-13T13:47:15Z) - On the Universal Adversarial Perturbations for Efficient Data-free
Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs.
Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z) - Applied Deep Learning to Identify and Localize Polyps from Endoscopic
Images [0.0]
We have aimed at open sourcing a dataset which contains annotations of polyps and ulcers.
This is the first dataset that's coming from India containing polyp and ulcer images.
We evaluated our dataset with several popular deep learning object detection models that's trained on large publicly available datasets.
arXiv Detail & Related papers (2023-01-22T22:14:25Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - COVID-19 Multidimensional Kaggle Literature Organization [3.201839066679614]
We show that factorization is a powerful unsupervised learning method capable of discovering hidden patterns in a document corpus.
We show that a higher-order representation of the corpus allows for the simultaneous grouping of similar articles, relevant journals, authors with similar research interests, and topic keywords.
arXiv Detail & Related papers (2021-07-17T06:16:36Z) - Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning.
It aims to extract both the common information and the complementary information in an adversarial setting.
In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z) - Contrastive analysis for scatter plot-based representations of
dimensionality reduction [0.0]
This paper introduces a methodology to explore multidimensional datasets and interpret clusters' formation.
We also introduce a bipartite graph to visually interpret and explore the relationship between the statistical variables used to understand how the attributes influenced cluster formation.
arXiv Detail & Related papers (2021-01-26T01:16:31Z) - Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of
Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure.
We find clinically relevant clusters from an embedded space derived from heterogeneous data.
The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z) - Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on
Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models.
We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z) - Visual Neural Decomposition to Explain Multivariate Data Sets [13.117139248511783]
Investigating relationships between variables in multi-dimensional data sets is a common task for data analysts and engineers.
We propose a novel approach to visualize correlations between input variables and a target output variable that scales to hundreds of variables.
arXiv Detail & Related papers (2020-09-11T15:53:37Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.