Repairing Systematic Outliers by Learning Clean Subspaces in VAEs
- URL: http://arxiv.org/abs/2207.08050v1
- Date: Sun, 17 Jul 2022 01:28:23 GMT
- Title: Repairing Systematic Outliers by Learning Clean Subspaces in VAEs
- Authors: Simao Eduardo, Kai Xu, Alfredo Nazabal, Charles Sutton
- Abstract summary: We propose Clean Subspace Vari Autoencoder (VAE), a novel semi-supervised model for detection and automated repair of systematic errors.
VAE is effective with much less labelled data compared to previous models, often with less than 2% of the data.
We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes.
- Score: 31.298063226774115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data cleaning often comprises outlier detection and data repair. Systematic
errors result from nearly deterministic transformations that occur repeatedly
in the data, e.g. specific image pixels being set to default values or
watermarks. Consequently, models with enough capacity easily overfit to these
errors, making detection and repair difficult. Seeing as a systematic outlier
is a combination of patterns of a clean instance and systematic error patterns,
our main insight is that inliers can be modelled by a smaller representation
(subspace) in a model than outliers. By exploiting this, we propose Clean
Subspace Variational Autoencoder (CLSVAE), a novel semi-supervised model for
detection and automated repair of systematic errors. The main idea is to
partition the latent space and model inlier and outlier patterns separately.
CLSVAE is effective with much less labelled data compared to previous related
models, often with less than 2% of the data. We provide experiments using three
image datasets in scenarios with different levels of corruption and labelled
set sizes, comparing to relevant baselines. CLSVAE provides superior repairs
without human intervention, e.g. with just 0.25% of labelled data we see a
relative error decrease of 58% compared to the closest baseline.
Related papers
- Regularized Contrastive Partial Multi-view Outlier Detection [76.77036536484114]
We propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD)
In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency.
Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-02T14:34:27Z) - Diffusion-based Image Generation for In-distribution Data Augmentation in Surface Defect Detection [8.93281936150572]
We show that diffusion models can be used in industrial scenarios to improve the data augmentation procedure.
We propose a novel approach for data augmentation that mixes out-of-distribution with in-distribution samples.
arXiv Detail & Related papers (2024-06-01T17:09:18Z) - Verifix: Post-Training Correction to Improve Label Noise Robustness with
Verified Samples [9.91998873101083]
Post-Training Correction adjusts model parameters after initial training to mitigate label noise.
We introduce Verifix, a novel algorithm that leverages a small, verified dataset to correct the model weights using a single update.
Experiments on the CIFAR dataset with 25% synthetic corruption show 7.36% generalization improvements on average.
arXiv Detail & Related papers (2024-03-13T15:32:08Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - Data Contamination: From Memorization to Exploitation [5.997909991352044]
It is not clear to what extent models exploit contaminated data for downstream tasks.
We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task.
Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it.
arXiv Detail & Related papers (2022-03-15T20:37:16Z) - Y-GAN: Learning Dual Data Representations for Efficient Anomaly
Detection [0.0]
We propose a novel reconstruction-based model for anomaly detection, called Y-GAN.
The model consists of a Y-shaped auto-encoder and represents images in two separate latent spaces.
arXiv Detail & Related papers (2021-09-28T20:17:04Z) - Efficient remedies for outlier detection with variational autoencoders [8.80692072928023]
Likelihoods computed by deep generative models are a candidate metric for outlier detection with unlabeled data.
We show that a theoretically-grounded correction readily ameliorates a key bias with VAE likelihood estimates.
We also show that the variance of the likelihoods computed over an ensemble of VAEs also enables robust outlier detection.
arXiv Detail & Related papers (2021-08-19T16:00:58Z) - Examining and Combating Spurious Features under Distribution Shift [94.31956965507085]
We define and analyze robust and spurious representations using the information-theoretic concept of minimal sufficient statistics.
We prove that even when there is only bias of the input distribution, models can still pick up spurious features from their training data.
Inspired by our analysis, we demonstrate that group DRO can fail when groups do not directly account for various spurious correlations.
arXiv Detail & Related papers (2021-06-14T05:39:09Z) - Variational Bayesian Unlearning [54.26984662139516]
We study the problem of approximately unlearning a Bayesian model from a small subset of the training data to be erased.
We show that it is equivalent to minimizing an evidence upper bound which trades off between fully unlearning from erased data vs. not entirely forgetting the posterior belief.
In model training with VI, only an approximate (instead of exact) posterior belief given the full data can be obtained, which makes unlearning even more challenging.
arXiv Detail & Related papers (2020-10-24T11:53:00Z) - Salvage Reusable Samples from Noisy Data for Robust Learning [70.48919625304]
We propose a reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images.
Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the networks.
arXiv Detail & Related papers (2020-08-06T02:07:21Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.