PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning
- URL: http://arxiv.org/abs/2003.05602v1
- Date: Thu, 12 Mar 2020 03:30:30 GMT
- Title: PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning
- Authors: Yuening Li, Daochen Zha, Praveen Kumar Venugopal, Na Zou, and Xia Hu
- Abstract summary: We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
- Score: 55.32009000204512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Outlier detection is an important task for various data mining applications.
Current outlier detection techniques are often manually designed for specific
domains, requiring large human efforts of database setup, algorithm selection,
and hyper-parameter tuning. To fill this gap, we present PyODDS, an automated
end-to-end Python system for Outlier Detection with Database Support, which
automatically optimizes an outlier detection pipeline for a new data source at
hand. Specifically, we define the search space in the outlier detection
pipeline, and produce a search strategy within the given search space. PyODDS
enables end-to-end executions based on an Apache Spark backend server and a
light-weight database. It also provides unified interfaces and visualizations
for users with or without data science or machine learning background. In
particular, we demonstrate PyODDS on several real-world datasets, with
quantification analysis and visualization results.
Related papers
- DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering.
We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines.
Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z) - AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis [3.3446830960153555]
We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing.
In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
arXiv Detail & Related papers (2022-12-18T07:49:17Z) - Lightweight Automated Feature Monitoring for Data Streams [1.4658400971135652]
We propose a flexible system, Feature Monitoring (FM), that detects data drifts in such data sets.
It monitors all features that are used by the system, while providing an interpretable features ranking whenever an alarm occurs.
This illustrates how FM eliminates the need to add custom signals to detect specific types of problems and that monitoring the available space of features is often enough.
arXiv Detail & Related papers (2022-07-18T14:38:11Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision
Datasets from 3D Scans [103.92680099373567]
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world.
Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
arXiv Detail & Related papers (2021-10-11T04:21:46Z) - Laser2Vec: Similarity-based Retrieval for Robotic Perception Data [7.538482310185135]
This paper implements a system for storing 2D LiDAR data from many deployments cheaply and evaluating top-k queries for complete or partial scans efficiently.
We generate compressed representations of laser scans via a convolutional variational autoencoder and store them in a database.
We find our system accurately and efficiently identifies similar scans across a number of episodes where the robot encountered the same location.
arXiv Detail & Related papers (2020-07-30T21:11:50Z) - AutoOD: Automated Outlier Detection via Curiosity-guided Search and
Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications.
We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model.
Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z) - An Intelligent and Time-Efficient DDoS Identification Framework for
Real-Time Enterprise Networks SAD-F: Spark Based Anomaly Detection Framework [0.5811502603310248]
We will be exploring security analytic techniques for DDoS anomaly detection using different machine learning techniques.
In this paper, we are proposing a novel approach which deals with real traffic as input to the system.
We study and compare the performance factor of our proposed framework on three different testbeds.
arXiv Detail & Related papers (2020-01-21T06:05:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.