Data+Shift: Supporting visual investigation of data distribution shifts
by data scientists
- URL: http://arxiv.org/abs/2204.14025v1
- Date: Fri, 29 Apr 2022 11:50:25 GMT
- Title: Data+Shift: Supporting visual investigation of data distribution shifts
by data scientists
- Authors: Jo\~ao Palmeiro, Beatriz Malveiro, Rita Costa, David Polido, Ricardo
Moreira, Pedro Bizarro
- Abstract summary: Data+Shift is a visual analytics tool to support data scientists in the task of investigating the underlying factors of shift in data features.
We validated our approach with a think-aloud experiment where a data scientist used the tool for a fraud detection use case.
- Score: 1.6311150636417262
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine learning on data streams is increasingly more present in multiple
domains. However, there is often data distribution shift that can lead machine
learning models to make incorrect decisions. While there are automatic methods
to detect when drift is happening, human analysis, often by data scientists, is
essential to diagnose the causes of the problem and adjust the system. We
propose Data+Shift, a visual analytics tool to support data scientists in the
task of investigating the underlying factors of shift in data features in the
context of fraud detection. Design requirements were derived from interviews
with data scientists. Data+Shift is integrated with JupyterLab and can be used
alongside other data science tools. We validated our approach with a
think-aloud experiment where a data scientist used the tool for a fraud
detection use case.
Related papers
- D3A-TS: Denoising-Driven Data Augmentation in Time Series [0.0]
This work focuses on studying and analyzing the use of different techniques for data augmentation in time series for classification and regression problems.
The proposed approach involves the use of diffusion probabilistic models, which have recently achieved successful results in the field of Image Processing.
The results highlight the high utility of this methodology in creating synthetic data to train classification and regression models.
arXiv Detail & Related papers (2023-12-09T11:37:07Z) - Adversarial Learning for Feature Shift Detection and Correction [45.65548560695731]
Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in structured data, where faulty standardization and data processing pipelines can lead to erroneous features.
In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets.
arXiv Detail & Related papers (2023-12-07T18:58:40Z) - Binary Quantification and Dataset Shift: An Experimental Investigation [54.14283123210872]
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data.
The relationship between quantification and other types of dataset shift remains, by and large, unexplored.
We propose a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift.
arXiv Detail & Related papers (2023-10-06T20:11:27Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - A Vision for Semantically Enriched Data Science [19.604667287258724]
Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation.
We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2023-03-02T16:03:12Z) - Time-Varying Propensity Score to Bridge the Gap between the Past and Present [104.46387765330142]
We introduce a time-varying propensity score that can detect gradual shifts in the distribution of data.
We demonstrate different ways of implementing it and evaluate it on a variety of problems.
arXiv Detail & Related papers (2022-10-04T07:21:49Z) - A unified framework for dataset shift diagnostics [2.449909275410288]
Supervised learning techniques typically assume training data originates from the target population.
Yet, dataset shift frequently arises, which, if not adequately taken into account, may decrease the performance of their predictors.
We propose a novel and flexible framework called DetectShift that quantifies and tests for multiple dataset shifts.
arXiv Detail & Related papers (2022-05-17T13:34:45Z) - DataLab: A Platform for Data Analysis and Intervention [96.75253335629534]
DataLab is a unified data-oriented platform that allows users to interactively analyze the characteristics of data.
toolname has features for dataset recommendation and global vision analysis.
So far, DataLab covers 1,715 datasets and 3,583 of its transformed version.
arXiv Detail & Related papers (2022-02-25T18:32:19Z) - MedShift: identifying shift data for medical dataset curation [2.4236602474594635]
Methods to detect shift or variance in data have not been significantly researched.
We propose a unified pipeline called MedShift to detect top-level shift samples.
We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source.
arXiv Detail & Related papers (2021-12-27T20:06:23Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.