Related papers: A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

URL: http://arxiv.org/abs/2304.02189v1
Date: Wed, 5 Apr 2023 02:09:15 GMT
Title: A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data
Authors: A. Ravishankar Rao, Daniel Clarke, Subrata Garai, Soumyabrata Dey
Abstract summary: We present a system that explores multiple combinations of variables using a searchlight technique and identifies outliers. We illustrate this system by anaylzing open health care data released by New York State. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides.
Score: 0.4588028371034407
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The interactive exploration of large and evolving datasets is challenging as relationships between underlying variables may not be fully understood. There may be hidden trends and patterns in the data that are worthy of further exploration and analysis. We present a system that methodically explores multiple combinations of variables using a searchlight technique and identifies outliers. An iterative k-means clustering algorithm is applied to features derived through a split-apply-combine paradigm used in the database literature. Outliers are identified as singleton or small clusters. This algorithm is swept across the dataset in a searchlight manner. The dimensions that contain outliers are combined in pairs with other dimensions using a susbset scan technique to gain further insight into the outliers. We illustrate this system by anaylzing open health care data released by New York State. We apply our iterative k-means searchlight followed by subset scanning. Several anomalous trends in the data are identified, including cost overruns at specific hospitals, and increases in diagnoses such as suicides. These constitute novel findings in the literature, and are of potential use to regulatory agencies, policy makers and concerned citizens.

Related papers

The importance of the clustering model to detect new types of intrusion in data traffic [0.0]
The presented work use K-means algorithm, which is a popular clustering technique. Data was gathered utilizing Kali Linux environment, cicflowmeter traffic, and Putty Software tools. The model counted the attacks and assigned numbers to each one of them.
arXiv Detail & Related papers (2024-11-21T19:40:31Z)
Categorical Data Clustering via Value Order Estimated Distance Metric Learning [53.28598689867732]
This paper introduces a novel order distance metric learning approach to intuitively represent categorical attribute values.<n>A new joint learning paradigm is developed to alternatively perform clustering and order distance metric learning.<n>The proposed method achieves superior clustering accuracy on categorical and mixed datasets.
arXiv Detail & Related papers (2024-11-19T08:23:25Z)
DeepHYDRA: Resource-Efficient Time-Series Anomaly Detection in Dynamically-Configured Systems [3.44012349879073]
We present DeepHYDRA (Deep Hybrid DBSCAN/Reduction-Based Anomaly Detection) It combines DBSCAN and learning-based anomaly detection. It is shown to reliably detect different types of anomalies in both large and complex datasets.
arXiv Detail & Related papers (2024-05-13T13:47:15Z)
On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection [55.73320979733527]
We propose a data-agnostic adversarial detection framework, which induces different responses between normal and adversarial samples to UAPs. Experimental results show that our method achieves competitive detection performance on various text classification tasks.
arXiv Detail & Related papers (2023-06-27T02:54:07Z)
Applied Deep Learning to Identify and Localize Polyps from Endoscopic Images [0.0]
We have aimed at open sourcing a dataset which contains annotations of polyps and ulcers. This is the first dataset that's coming from India containing polyp and ulcer images. We evaluated our dataset with several popular deep learning object detection models that's trained on large publicly available datasets.
arXiv Detail & Related papers (2023-01-22T22:14:25Z)
MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types. MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random. We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z)
Deep Co-Attention Network for Multi-View Subspace Learning [73.3450258002607]
We propose a deep co-attention network for multi-view subspace learning. It aims to extract both the common information and the complementary information in an adversarial setting. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation.
arXiv Detail & Related papers (2021-02-15T18:46:44Z)
Contrastive analysis for scatter plot-based representations of dimensionality reduction [0.0]
This paper introduces a methodology to explore multidimensional datasets and interpret clusters' formation. We also introduce a bipartite graph to visually interpret and explore the relationship between the statistical variables used to understand how the attributes influenced cluster formation.
arXiv Detail & Related papers (2021-01-26T01:16:31Z)
Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of Heart Failure Patients [50.48904066814385]
In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure. We find clinically relevant clusters from an embedded space derived from heterogeneous data. The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes.
arXiv Detail & Related papers (2020-12-24T12:56:46Z)
Adversarial Examples for $k$-Nearest Neighbor Classifiers Based on Higher-Order Voronoi Diagrams [69.4411417775822]
Adversarial examples are a widely studied phenomenon in machine learning models. We propose an algorithm for evaluating the adversarial robustness of $k$-nearest neighbor classification.
arXiv Detail & Related papers (2020-11-19T08:49:10Z)
Visual Neural Decomposition to Explain Multivariate Data Sets [13.117139248511783]
Investigating relationships between variables in multi-dimensional data sets is a common task for data analysts and engineers. We propose a novel approach to visualize correlations between input variables and a target output variable that scales to hundreds of variables.
arXiv Detail & Related papers (2020-09-11T15:53:37Z)
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.