Related papers: PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning

PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning

URL: http://arxiv.org/abs/2003.05602v1
Date: Thu, 12 Mar 2020 03:30:30 GMT
Title: PyODDS: An End-to-end Outlier Detection System with Automated Machine Learning
Authors: Yuening Li, Daochen Zha, Praveen Kumar Venugopal, Na Zou, and Xia Hu
Abstract summary: We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support. Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space. It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
Score: 55.32009000204512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Outlier detection is an important task for various data mining applications. Current outlier detection techniques are often manually designed for specific domains, requiring large human efforts of database setup, algorithm selection, and hyper-parameter tuning. To fill this gap, we present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support, which automatically optimizes an outlier detection pipeline for a new data source at hand. Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space. PyODDS enables end-to-end executions based on an Apache Spark backend server and a light-weight database. It also provides unified interfaces and visualizations for users with or without data science or machine learning background. In particular, we demonstrate PyODDS on several real-world datasets, with quantification analysis and visualization results.

Related papers

AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis [0.0]
The system establishes the process of AI-based analysis and visualization from the context of data-driven environments.<n>Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers.<n>The initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests.
arXiv Detail & Related papers (2025-11-11T15:39:09Z)
AUTO-Explorer: Automated Data Collection for GUI Agent [58.58097564914626]
We propose an automated data collection method with minimal annotation costs, named Auto-Explorer.<n>It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments.<n>Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set.
arXiv Detail & Related papers (2025-11-09T15:13:45Z)
PyOD 2: A Python Library for Outlier Detection with LLM-powered Model Selection [18.108558631930766]
Outlier detection (OD) is a critical machine learning (ML) task with applications in fraud detection, network intrusion detection, clickstream analysis, recommendation systems, and social network moderation. PyOD is the most widely adopted library for OD, with over 8,500 GitHub stars, 25 million downloads, and diverse industry usage. PyOD Version 2 (PyOD 2) integrates 12 state-of-the-art deep learning models into a PyTorch framework and introduces a large language model (LLM)-based pipeline for automated OD model selection.
arXiv Detail & Related papers (2024-12-11T07:53:20Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
Towards Personalized Preprocessing Pipeline Search [52.59156206880384]
ClusterP3S is a novel framework for Personalized Preprocessing Pipeline Search via Clustering. We propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.
arXiv Detail & Related papers (2023-02-28T05:45:05Z)
AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis [3.3446830960153555]
We present Autoslicer, a scalable system that searches for problematic slices through distributed metric computation and hypothesis testing. In the experiments, we show that our search strategy finds most of the anomalous slices by inspecting a small portion of the search space.
arXiv Detail & Related papers (2022-12-18T07:49:17Z)
Lightweight Automated Feature Monitoring for Data Streams [1.4658400971135652]
We propose a flexible system, Feature Monitoring (FM), that detects data drifts in such data sets. It monitors all features that are used by the system, while providing an interpretable features ranking whenever an alarm occurs. This illustrates how FM eliminates the need to add custom signals to detect specific types of problems and that monitoring the available space of features is often enough.
arXiv Detail & Related papers (2022-07-18T14:38:11Z)
DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data. toolname has features for dataset recommendation and global vision analysis. So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z)
Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans [103.92680099373567]
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world. Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information. Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
arXiv Detail & Related papers (2021-10-11T04:21:46Z)
TODS: An Automated Time Series Outlier Detection System [70.88663649631857]
TODS is a highly modular system that supports easy pipeline construction.<n>Tods supports 70 primitives, including data processing, time series processing, feature analysis, detection algorithms, and a reinforcement module.
arXiv Detail & Related papers (2020-09-18T15:36:43Z)
Laser2Vec: Similarity-based Retrieval for Robotic Perception Data [7.538482310185135]
This paper implements a system for storing 2D LiDAR data from many deployments cheaply and evaluating top-k queries for complete or partial scans efficiently. We generate compressed representations of laser scans via a convolutional variational autoencoder and store them in a database. We find our system accurately and efficiently identifies similar scans across a number of episodes where the robot encountered the same location.
arXiv Detail & Related papers (2020-07-30T21:11:50Z)
AutoOD: Automated Outlier Detection via Curiosity-guided Search and Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications. We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model. Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z)
An Intelligent and Time-Efficient DDoS Identification Framework for Real-Time Enterprise Networks SAD-F: Spark Based Anomaly Detection Framework [0.5811502603310248]
We will be exploring security analytic techniques for DDoS anomaly detection using different machine learning techniques. In this paper, we are proposing a novel approach which deals with real traffic as input to the system. We study and compare the performance factor of our proposed framework on three different testbeds.
arXiv Detail & Related papers (2020-01-21T06:05:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.