Related papers: PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling

PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling

URL: http://arxiv.org/abs/2506.18249v1
Date: Mon, 23 Jun 2025 02:37:05 GMT
Title: PCA-Guided Quantile Sampling: Preserving Data Structure in Large-Scale Subsampling
Authors: Foo Hui-Mean, Yuan-chin Ivan Chang,
Abstract summary: We introduce Principal Component Analysis guided Quantile Sampling (PCA QS)<n>PCA QS is a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets.<n>We show that PCA QS consistently outperforms simple random sampling, yielding better structure and improved downstream model performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Principal Component Analysis guided Quantile Sampling (PCA QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA QS retains the original feature space while using leading principal components solely to guide a quantile based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets show that PCA QS consistently outperforms simple random sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.

Related papers

QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z)
Adaptive Dataset Quantization [2.0105434963031463]
We introduce a versatile framework for dataset compression, namely Adaptive dataset Quantization (ADQ)<n>We propose a novel adaptive sampling strategy through the evaluation of generated bins' representativeness score, diversity score and importance score.<n>Our method not only exhibits superior generalization capability across different architectures, but also attains state-of-the-art results, surpassing DQ by average 3% on various datasets.
arXiv Detail & Related papers (2024-12-22T07:08:29Z)
Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers [49.1574468325115]
We introduce a unified convergence analysis framework for deterministic samplers. Our framework achieves iteration complexity of $tilde O(d2/epsilon)$. We also provide a detailed analysis of Denoising Implicit Diffusion Models (DDIM)-type samplers.
arXiv Detail & Related papers (2024-10-18T07:37:36Z)
Bayesian tomography using polynomial chaos expansion and deep generative networks [0.0]
We present a strategy combining the excellent reconstruction performances of a variational autoencoder (VAE) with the accuracy of PCA-PCE surrogate modeling. Within the MCMC process, the parametrization of the VAE is leveraged for prior exploration and sample proposals.
arXiv Detail & Related papers (2023-07-09T16:44:37Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Importance sampling for stochastic quantum simulations [68.8204255655161]
We introduce the qDrift protocol, which builds random product formulas by sampling from the Hamiltonian according to the coefficients. We show that the simulation cost can be reduced while achieving the same accuracy, by considering the individual simulation cost during the sampling stage. Results are confirmed by numerical simulations performed on a lattice nuclear effective field theory.
arXiv Detail & Related papers (2022-12-12T15:06:32Z)
ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ. To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics. We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z)
CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE) At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z)
Self-paced Principal Component Analysis [17.333976289539457]
We propose a novel method called Self-paced PCA (SPCA) to further reduce the effect of noise and outliers. The complexity of each sample is calculated at the beginning of each iteration in order to integrate samples from simple to more complex into training.
arXiv Detail & Related papers (2021-06-25T20:50:45Z)
Empirical Bayes PCA in high dimensions [11.806200054814772]
Principal Components Analysis is known to exhibit problematic phenomena of high-dimensional noise. We propose an Empirical Bayes PCA method that reduces this noise by estimating a structural prior for the joint distributions of the principal components.
arXiv Detail & Related papers (2020-12-21T20:43:44Z)
Probabilistic Contrastive Principal Component Analysis [0.5286651840245514]
We propose a model-based alternative to contrastive principal component analysis ( CPCA) We show PCPCA's advantages over CPCA, including greater interpretability, uncertainty quantification and principled inference. We demonstrate PCPCA's performance through a series of simulations and case-control experiments with datasets of gene expression, protein expression, and images.
arXiv Detail & Related papers (2020-12-14T22:21:50Z)
Unsupervised learning of disentangled representations in deep restricted kernel machines with orthogonality constraints [15.296955630621566]
Constr-DRKM is a deep kernel method for the unsupervised learning of disentangled data representations. We quantitatively evaluate the proposed method's effectiveness in disentangled feature learning.
arXiv Detail & Related papers (2020-11-25T11:40:10Z)
Repulsive Mixture Models of Exponential Family PCA for Clustering [127.90219303669006]
The mixture extension of exponential family principal component analysis ( EPCA) was designed to encode much more structural information about data distribution than the traditional EPCA. The traditional mixture of local EPCAs has the problem of model redundancy, i.e., overlaps among mixing components, which may cause ambiguity for data clustering. In this paper, a repulsiveness-encouraging prior is introduced among mixing components and a diversified EPCA mixture (DEPCAM) model is developed in the Bayesian framework.
arXiv Detail & Related papers (2020-04-07T04:07:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.