Related papers: Unsupervised Discretization by Two-dimensional MDL-based Histogram

Unsupervised Discretization by Two-dimensional MDL-based Histogram

URL: http://arxiv.org/abs/2006.01893v3
Date: Mon, 18 Jul 2022 14:54:14 GMT
Title: Unsupervised Discretization by Two-dimensional MDL-based Histogram
Authors: Lincen Yang, Mitra Baratchi, and Matthijs van Leeuwen
Abstract summary: Unsupervised discretization is a crucial step in many knowledge discovery tasks. We propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We introduce a algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which Partitions each dimension ALternately and then Merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM 1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; 2) approximates well a wide range of partitions outside the model class; 3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.

Related papers

Generalized Grade-of-Membership Estimation for High-dimensional Locally Dependent Data [6.626575011678484]
Mixed membership models are widely used for analyzing survey responses and population genetics data. Existing approaches, such as Bayesian MCMC inference, are not scalable and lack theoretical guarantees in high-dimensional settings. We introduce a novel and simple approach that flattens the three-way quasi-tensor into a "fat" matrix, and then perform a singular value decomposition of it to estimate parameters.
arXiv Detail & Related papers (2024-12-27T18:51:15Z)
Proper Latent Decomposition [4.266376725904727]
We compute a reduced set of intrinsic coordinates (latent space) to accurately describe a flow with fewer degrees of freedom than the numerical discretization. With this proposed numerical framework, we propose an algorithm to perform PLD on the manifold. This work opens opportunities for analyzing autoencoders and latent spaces, nonlinear reduced-order modeling and scientific insights into the structure of high-dimensional data.
arXiv Detail & Related papers (2024-12-01T12:19:08Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices. We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z)
Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. CMDPs serve as an important framework to model many real-world applications with time-varying environments. We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z)
Optimal Discriminant Analysis in High-Dimensional Latent Factor Models [1.4213973379473654]
In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space. We formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections.
arXiv Detail & Related papers (2022-10-23T21:45:53Z)
Laplacian-based Cluster-Contractive t-SNE for High Dimensional Data Visualization [20.43471678277403]
We propose LaptSNE, a new graph-based dimensionality reduction method based on t-SNE. Specifically, LaptSNE leverages the eigenvalue information of the graph Laplacian to shrink the potential clusters in the low-dimensional embedding. We show how to calculate the gradient analytically, which may be of broad interest when considering optimization with Laplacian-composited objective.
arXiv Detail & Related papers (2022-07-25T14:10:24Z)
Robust Multi-view Registration of Point Sets with Laplacian Mixture Model [25.865100974015412]
We propose a novel probabilistic generative method to align multiple point sets based on the heavy-tailed Laplacian distribution. We demonstrate the advantages of our method by comparing it with representative state-of-the-art approaches on benchmark challenging data sets.
arXiv Detail & Related papers (2021-10-26T14:49:09Z)
Manifold Topology Divergence: a Framework for Comparing Data Manifolds [109.0784952256104]
We develop a framework for comparing data manifold, aimed at the evaluation of deep generative models. Based on the Cross-Barcode, we introduce the Manifold Topology Divergence score (MTop-Divergence) We demonstrate that the MTop-Divergence accurately detects various degrees of mode-dropping, intra-mode collapse, mode invention, and image disturbance.
arXiv Detail & Related papers (2021-06-08T00:30:43Z)
The classification for High-dimension low-sample size data [3.411873646414169]
We propose a novel classification criterion on HDLSS, tolerance, which emphasizes similarity of within-class variance on the premise of class separability. According to this criterion, a novel linear binary classifier is designed, denoted by No-separated Data Dispersion Maximum (NPDMD) NPDMD has several characteristics compared to the state-of-the-art classification methods.
arXiv Detail & Related papers (2020-06-21T07:04:16Z)
Multi-Objective Matrix Normalization for Fine-grained Visual Recognition [153.49014114484424]
Bilinear pooling achieves great success in fine-grained visual recognition (FGVC) Recent methods have shown that the matrix power normalization can stabilize the second-order information in bilinear features. We propose an efficient Multi-Objective Matrix Normalization (MOMN) method that can simultaneously normalize a bilinear representation.
arXiv Detail & Related papers (2020-03-30T08:40:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.