Fast and fully-automated histograms for large-scale data sets
- URL: http://arxiv.org/abs/2212.13524v1
- Date: Tue, 27 Dec 2022 15:37:10 GMT
- Title: Fast and fully-automated histograms for large-scale data sets
- Authors: Valentina Zelaya Mendiz\'abal (SAMM), Marc Boull\'e, Fabrice Rossi
(CEREMADE)
- Abstract summary: G-Enum histograms are a new fast and fully automated method for irregular histogram construction.
They leverage the Minimum Description Length principle to derive two different model selection criteria.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: G-Enum histograms are a new fast and fully automated method for irregular
histogram construction. By framing histogram construction as a density
estimation problem and its automation as a model selection task, these
histograms leverage the Minimum Description Length principle (MDL) to derive
two different model selection criteria. Several proven theoretical results
about these criteria give insights about their asymptotic behavior and are used
to speed up their optimisation. These insights, combined to a greedy search
heuristic, are used to construct histograms in linearithmic time rather than
the polynomial time incurred by previous works. The capabilities of the
proposed MDL density estimation method are illustrated with reference to other
fully automated methods in the literature, both on synthetic and large
real-world data sets.
Related papers
- Align Your Steps: Optimizing Sampling Schedules in Diffusion Models [63.927438959502226]
Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond.
A crucial drawback of DMs is their slow sampling speed, relying on many sequential function evaluations through large neural networks.
We propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
arXiv Detail & Related papers (2024-04-22T18:18:41Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Maximum Independent Set: Self-Training through Dynamic Programming [56.670639478539485]
This work presents a graph neural network (GNN) framework for solving the maximum independent set (MIS) problem, inspired by dynamic programming (DP)
We propose a DP-like recursive algorithm based on GNNs that firstly constructs two smaller sub-graphs, predicts the one with the larger MIS, and then uses it in the next recursion call.
Annotating comparisons of different graphs concerning their MIS size leads to a self-training process that results in more accurate self-annotation of the comparisons and vice versa.
arXiv Detail & Related papers (2023-10-28T10:58:25Z) - RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching)
To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth.
We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z) - Learning the hub graphical Lasso model with the structured sparsity via
an efficient algorithm [1.0923877073891446]
We introduce a two-phase algorithm to estimate hub graphical models.
The proposed algorithm first generates a good initial point via a dual alternating direction method of multipliers.
It then warms a semismooth Newton (SSN) based augmented Lagrangian method (ALM) to compute a solution that is accurate enough for practical tasks.
arXiv Detail & Related papers (2023-08-17T08:24:28Z) - MADS: Modulated Auto-Decoding SIREN for time series imputation [9.673093148930874]
We propose MADS, a novel auto-decoding framework for time series imputation, built upon implicit neural representations.
We evaluate our model on two real-world datasets, and show that it outperforms state-of-the-art methods for time series imputation.
arXiv Detail & Related papers (2023-07-03T09:08:47Z) - Two-level histograms for dealing with outliers and heavy tail
distributions [0.0]
We focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter.
We investigate on the limits of this method in the case of outliers or heavy-tailed distributions.
The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values.
The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram.
arXiv Detail & Related papers (2023-06-09T09:57:18Z) - Integrating Transformer and Autoencoder Techniques with Spectral Graph
Algorithms for the Prediction of Scarcely Labeled Molecular Data [2.8360662552057323]
This work introduces three graph-based models incorporating Merriman-Bence-Osher (MBO) techniques to tackle this challenge.
Specifically, graph-based modifications of the MBO scheme is integrated with state-of-the-art techniques, including a home-made transformer and an autoencoder.
The proposed models are validated using five benchmark data sets.
arXiv Detail & Related papers (2022-11-12T22:45:32Z) - Imputing Missing Observations with Time Sliced Synthetic Minority
Oversampling Technique [0.3973560285628012]
We present a simple yet novel time series imputation technique with the goal of constructing an irregular time series that is uniform across every sample in a data set.
We fix a grid defined by the midpoints of non-overlapping bins (dubbed "slices") of observation times and ensure that each sample has values for all of the features at that given time.
This allows one to both impute fully missing observations to allow uniform time series classification across the entire data and, in special cases, to impute individually missing features.
arXiv Detail & Related papers (2022-01-14T19:23:24Z) - Auto-decoding Graphs [91.3755431537592]
The generative model is an auto-decoder that learns to synthesize graphs from latent codes.
Graphs are synthesized using self-attention modules that are trained to identify likely connectivity patterns.
arXiv Detail & Related papers (2020-06-04T14:23:01Z) - Block-Approximated Exponential Random Graphs [77.4792558024487]
An important challenge in the field of exponential random graphs (ERGs) is the fitting of non-trivial ERGs on large graphs.
We propose an approximative framework to such non-trivial ERGs that result in dyadic independence (i.e., edge independent) distributions.
Our methods are scalable to sparse graphs consisting of millions of nodes.
arXiv Detail & Related papers (2020-02-14T11:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.