Markov subsampling based Huber Criterion
- URL: http://arxiv.org/abs/2112.06134v1
- Date: Sun, 12 Dec 2021 03:11:23 GMT
- Title: Markov subsampling based Huber Criterion
- Authors: Tieliang Gong and Yuxin Dong and Hong Chen and Bo Dong and Chen Li
- Abstract summary: Subsampling is an important technique to tackle the computational challenges brought by big data.
We design a new Markov subsampling strategy based on Huber criterion (HMS) to construct an informative subset from the noisy full data.
HMS is built upon a Metropolis-Hasting procedure, where the inclusion probability of each sampling unit is determined.
Under mild conditions, we show that the estimator based on the subsamples selected by HMS is statistically consistent with a sub-Gaussian deviation bound.
- Score: 13.04847430878172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subsampling is an important technique to tackle the computational challenges
brought by big data. Many subsampling procedures fall within the framework of
importance sampling, which assigns high sampling probabilities to the samples
appearing to have big impacts. When the noise level is high, those sampling
procedures tend to pick many outliers and thus often do not perform
satisfactorily in practice. To tackle this issue, we design a new Markov
subsampling strategy based on Huber criterion (HMS) to construct an informative
subset from the noisy full data; the constructed subset then serves as a
refined working data for efficient processing. HMS is built upon a
Metropolis-Hasting procedure, where the inclusion probability of each sampling
unit is determined using the Huber criterion to prevent over scoring the
outliers. Under mild conditions, we show that the estimator based on the
subsamples selected by HMS is statistically consistent with a sub-Gaussian
deviation bound. The promising performance of HMS is demonstrated by extensive
studies on large scale simulations and real data examples.
Related papers
- iHHO-SMOTe: A Cleansed Approach for Handling Outliers and Reducing Noise to Improve Imbalanced Data Classification [0.0]
Classifying imbalanced datasets remains a significant challenge in machine learning.
Synthetic Minority Over-sampling Technique (SMOTE) generates new instances for the under-represented minority class.
A proposed approach, iHHO-SMOTe, addresses the limitations of SMOTE by first cleansing the data from noise points.
arXiv Detail & Related papers (2025-04-17T11:17:53Z) - Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection [84.78475642696137]
The existence of noisy labels in real-world data negatively impacts the performance of deep learning models.
We propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS)
SGPS constructs reliable positive pairs for noisy samples to enhance the sample utilization.
arXiv Detail & Related papers (2025-01-19T14:41:55Z) - ANNE: Adaptive Nearest Neighbors and Eigenvector-based Sample Selection for Robust Learning with Noisy Labels [7.897299759691143]
This paper introduces the Adaptive Nearest Neighbors and Eigenvector-based (ANNE) sample selection methodology.
ANNE integrates loss-based sampling with the feature-based sampling methods FINE and Adaptive KNN to optimize performance across a wide range of noise rate scenarios.
arXiv Detail & Related papers (2024-11-03T15:51:38Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - BSGAN: A Novel Oversampling Technique for Imbalanced Pattern
Recognitions [0.0]
Class imbalanced problems (CIP) are one of the potential challenges in developing unbiased Machine Learning (ML) models for predictions.
CIP occurs when data samples are not equally distributed between the two or multiple classes.
We propose a hybrid oversampling technique by combining the power of borderline SMOTE and Generative Adrial Network to generate more diverse data.
arXiv Detail & Related papers (2023-05-16T20:02:39Z) - Towards Automated Imbalanced Learning with Deep Hierarchical
Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class.
Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class.
We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z) - A Novel Hybrid Sampling Framework for Imbalanced Learning [0.0]
"SMOTE-RUS-NC" has been compared with other state-of-the-art sampling techniques.
Rigorous experimentation has been conducted on 26 imbalanced datasets.
arXiv Detail & Related papers (2022-08-20T07:04:00Z) - Adaptive Sketches for Robust Regression with Importance Sampling [64.75899469557272]
We introduce data structures for solving robust regression through gradient descent (SGD)
Our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data.
arXiv Detail & Related papers (2022-07-16T03:09:30Z) - Robust Meta-learning with Sampling Noise and Label Noise via
Eigen-Reptile [78.1212767880785]
meta-learner is prone to overfitting since there are only a few available samples.
When handling the data with noisy labels, the meta-learner could be extremely sensitive to label noise.
We present Eigen-Reptile (ER) that updates the meta- parameters with the main direction of historical task-specific parameters.
arXiv Detail & Related papers (2022-06-04T08:48:02Z) - Noise-Resistant Deep Metric Learning with Probabilistic Instance
Filtering [59.286567680389766]
Noisy labels are commonly found in real-world data, which cause performance degradation of deep neural networks.
We propose Probabilistic Ranking-based Instance Selection with Memory (PRISM) approach for DML.
PRISM calculates the probability of a label being clean, and filters out potentially noisy samples.
arXiv Detail & Related papers (2021-08-03T12:15:25Z) - Robust M-Estimation Based Bayesian Cluster Enumeration for Real
Elliptically Symmetric Distributions [5.137336092866906]
Robustly determining optimal number of clusters in a data set is an essential factor in a wide range of applications.
This article generalizes so that it can be used with any arbitrary Really Symmetric (RES) distributed mixture model.
We derive a robust criterion for data sets with finite sample size, and also provide an approximation to reduce the computational cost at large sample sizes.
arXiv Detail & Related papers (2020-05-04T11:44:49Z) - CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for
Imbalanced Data Classification [1.8275108630751844]
We propose a novel data-level algorithm for handling data imbalance in the classification task, Synthetic Majority Undersampling Technique (SMUTE)
We combine both in the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE), which integrates SMOTE oversampling with SMUTE undersampling.
The results of the conducted experimental study demonstrate the usefulness of both the SMUTE and the CSMOUTOUTE algorithms, especially when combined with more complex outliers.
arXiv Detail & Related papers (2020-04-07T14:03:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.