Sliced-Wasserstein Distance-based Data Selection
- URL: http://arxiv.org/abs/2504.12918v1
- Date: Thu, 17 Apr 2025 13:07:26 GMT
- Title: Sliced-Wasserstein Distance-based Data Selection
- Authors: Julien Pallage, Antoine Lesage-Landry,
- Abstract summary: We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance.<n>Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors.<n>We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.
Related papers
- A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep Learning [7.081408406139507]
Residual moveout (RMO) provides critical information for travel time tomography.
Current analytical approach does not accurately capture local saltation.
Supervised learning-based image segmentation methods for picking can effectively capture local variations.
arXiv Detail & Related papers (2025-03-08T03:27:55Z) - Dataset Distillation as Pushforward Optimal Quantization [1.039189397779466]
We propose a simple extension of the state-of-the-art data distillation method D4M, achieving better performance on the ImageNet-1K dataset with trivial additional computation.
We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem.
In particular, we link existing disentangled dataset distillation methods to the classical optimal quantization and Wasserstein barycenter problems, demonstrating consistency of distilled datasets for diffusion-based generative priors.
arXiv Detail & Related papers (2025-01-13T20:41:52Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Sliced-Wasserstein-based Anomaly Detection and Open Dataset for Localized Critical Peak Rebates [25.452449432754698]
We present a new unsupervised anomaly (outlier) detection (AD) method using the sliced-Wasserstein metric.<n>This filtering technique is conceptually interesting for MLOps pipelines deploying machine learning models in critical sectors.
arXiv Detail & Related papers (2024-10-29T03:54:48Z) - Loss-Free Machine Unlearning [51.34904967046097]
We present a machine unlearning approach that is both retraining- and label-free.
Retraining-free approaches often utilise Fisher information, which is derived from the loss and requires labelled data which may not be available.
We present an extension to the Selective Synaptic Dampening algorithm, substituting the diagonal of the Fisher information matrix for the gradient of the l2 norm of the model output to approximate sensitivity.
arXiv Detail & Related papers (2024-02-29T16:15:34Z) - Towards Free Data Selection with General-Purpose Models [71.92151210413374]
A desirable data selection algorithm can efficiently choose the most informative samples to maximize the utility of limited annotation budgets.
Current approaches, represented by active learning methods, typically follow a cumbersome pipeline that iterates the time-consuming model training and batch data selection repeatedly.
FreeSel bypasses the heavy batch selection process, achieving a significant improvement in efficiency and being 530x faster than existing active learning methods.
arXiv Detail & Related papers (2023-09-29T15:50:14Z) - Low-rank extended Kalman filtering for online learning of neural
networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream.
The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix.
In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - Fair Data Representation for Machine Learning at the Pareto Frontier [3.6052935394000234]
We propose a pre-processing algorithm for fair data representation via supervised learning.
We show that the Wasserstein-2 geodesics from the conditional (on sensitive information) distributions of the learning outcome to their barycenter characterizes the frontier between $L2$-loss and the average pairwise Wasserstein-2 distance.
Numerical simulations underscore the advantages: (1) the pre-processing step is compositive with arbitrary conditional expectation estimation supervised learning methods and unseen data; (2) the fair representation protects the sensitive information by limiting the inference capability of the remaining data with respect to the sensitive data; and (3) the optimal affine
arXiv Detail & Related papers (2022-01-02T05:05:26Z) - Deep Shells: Unsupervised Shape Correspondence with Optimal Transport [52.646396621449]
We propose a novel unsupervised learning approach to 3D shape correspondence.
We show that the proposed method significantly improves over the state-of-the-art on multiple datasets.
arXiv Detail & Related papers (2020-10-28T22:24:07Z) - Kernel Ridge Regression Using Importance Sampling with Application to
Seismic Response Prediction [1.4180331276028657]
We propose a novel landmark selection method that promotes diversity using an efficient two-step approach.
We also investigate the performance of several landmark selection techniques using a novel application of kernel methods for predicting structural responses due to earthquake load and material uncertainties.
arXiv Detail & Related papers (2020-09-19T01:44:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.