A sub-sampling algorithm preventing outliers
- URL: http://arxiv.org/abs/2208.06218v1
- Date: Fri, 12 Aug 2022 11:03:57 GMT
- Title: A sub-sampling algorithm preventing outliers
- Authors: L. Deldossi and E. Pesce and C. Tommasi
- Abstract summary: We propose an unsupervised exchange procedure that enables us to select a nearly D-optimal subset of observations without high leverage points.
We also provide a supervised version of this exchange procedure, where besides high leverage points also the outliers in the responses are avoided.
Both the unsupervised and the supervised selection procedures are generalized to I-optimality, with the goal of getting accurate predictions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, in many different fields, massive data are available and for
several reasons, it might be convenient to analyze just a subset of the data.
The application of the D-optimality criterion can be helpful to optimally
select a subsample of observations. However, it is well known that D-optimal
support points lie on the boundary of the design space and if they go hand in
hand with extreme response values, they can have a severe influence on the
estimated linear model (leverage points with high influence). To overcome this
problem, firstly, we propose an unsupervised exchange procedure that enables us
to select a nearly D-optimal subset of observations without high leverage
values. Then, we provide a supervised version of this exchange procedure, where
besides high leverage points also the outliers in the responses (that are not
associated to high leverage points) are avoided. This is possible because,
unlike other design situations, in subsampling from big datasets the response
values may be available.
Finally, both the unsupervised and the supervised selection procedures are
generalized to I-optimality, with the goal of getting accurate predictions.
Related papers
- Optimization Can Learn Johnson Lindenstrauss Embeddings [30.652854230884145]
Randomized methods like Johnson-Lindenstrauss (JL) provide unimprovable theoretical guarantees for achieving such representations.
We present a novel method motivated by diffusion models, that circumvents this fundamental challenge.
We show that by moving through this larger space, the objective converges to a deterministic (zero variance) solution, avoiding bad stationary points.
arXiv Detail & Related papers (2024-12-10T07:07:04Z) - TAROT: Targeted Data Selection via Optimal Transport [64.56083922130269]
TAROT is a targeted data selection framework grounded in optimal transport theory.
Previous targeted data selection methods rely on influence-based greedys to enhance domain-specific performance.
We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning.
arXiv Detail & Related papers (2024-11-30T10:19:51Z) - Rethinking Data Selection at Scale: Random Selection is Almost All You Need [39.14807071480125]
Supervised fine-tuning is crucial for aligning Large Language Models with human instructions.
Most existing data selection techniques are designed for small-scale data pools.
arXiv Detail & Related papers (2024-10-12T02:48:34Z) - Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization [60.176008034221404]
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences.
Prior work has observed that the likelihood of preferred responses often decreases during training.
We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
arXiv Detail & Related papers (2024-10-11T14:22:44Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Rethinking Missing Data: Aleatoric Uncertainty-Aware Recommendation [59.500347564280204]
We propose a new Aleatoric Uncertainty-aware Recommendation (AUR) framework.
AUR consists of a new uncertainty estimator along with a normal recommender model.
As the chance of mislabeling reflects the potential of a pair, AUR makes recommendations according to the uncertainty.
arXiv Detail & Related papers (2022-09-22T04:32:51Z) - AdAUC: End-to-end Adversarial AUC Optimization Against Long-tail
Problems [102.95119281306893]
We present an early trial to explore adversarial training methods to optimize AUC.
We reformulate the AUC optimization problem as a saddle point problem, where the objective becomes an instance-wise function.
Our analysis differs from the existing studies since the algorithm is asked to generate adversarial examples by calculating the gradient of a min-max problem.
arXiv Detail & Related papers (2022-06-24T09:13:39Z) - Efficient SVDD Sampling with Approximation Guarantees for the Decision
Boundary [7.251418581794502]
Support Vector Data Description (SVDD) is a popular one-class classifier for anomaly and novelty detection.
Despite its effectiveness, SVDD does not scale well with data size.
In this article, we study how to select a sample considering these points.
Our approach is to frame SVDD sampling as an optimization problem, where constraints guarantee that sampling indeed approximates the original decision boundary.
arXiv Detail & Related papers (2020-09-29T08:28:01Z) - Consistent and Flexible Selectivity Estimation for High-Dimensional Data [23.016360687961193]
We propose a new deep learning-based model that learns a query-dependent piecewise linear function as selectivity estimator.
We show that the proposed model consistently outperforms state-of-the-art models in accuracy in an efficient way.
arXiv Detail & Related papers (2020-05-20T08:24:53Z) - A Support Detection and Root Finding Approach for Learning
High-dimensional Generalized Linear Models [10.103666349083165]
We develop a support detection and root finding procedure to learn the high dimensional sparse generalized linear models.
We conduct simulations and real data analysis to illustrate the advantages of our proposed method over several existing methods.
arXiv Detail & Related papers (2020-01-16T14:35:17Z) - Supervised Hyperalignment for multi-subject fMRI data alignment [81.8694682249097]
This paper proposes a Supervised Hyperalignment (SHA) method to ensure better functional alignment for MVP analysis.
Experiments on multi-subject datasets demonstrate that SHA method achieves up to 19% better performance for multi-class problems.
arXiv Detail & Related papers (2020-01-09T09:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.