Relationship-aware Multivariate Sampling Strategy for Scientific
Simulation Data
- URL: http://arxiv.org/abs/2008.13306v1
- Date: Mon, 31 Aug 2020 00:52:17 GMT
- Title: Relationship-aware Multivariate Sampling Strategy for Scientific
Simulation Data
- Authors: Subhashis Hazarika, Ayan Biswas, Phillip J. Wolfram, Earl Lawrence,
Nathan Urban
- Abstract summary: In this work, we propose a multivariate sampling strategy which preserves the original variable relationships.
Our proposed strategy utilizes principal component analysis to capture the variance of multivariate data and can be built on top of any existing state-of-the-art sampling algorithms for single variables.
- Score: 4.2855912967712815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increasing computational power of current supercomputers, the size
of data produced by scientific simulations is rapidly growing. To reduce the
storage footprint and facilitate scalable post-hoc analyses of such scientific
data sets, various data reduction/summarization methods have been proposed over
the years. Different flavors of sampling algorithms exist to sample the
high-resolution scientific data, while preserving important data properties
required for subsequent analyses. However, most of these sampling algorithms
are designed for univariate data and cater to post-hoc analyses of single
variables. In this work, we propose a multivariate sampling strategy which
preserves the original variable relationships and enables different
multivariate analyses directly on the sampled data. Our proposed strategy
utilizes principal component analysis to capture the variance of multivariate
data and can be built on top of any existing state-of-the-art sampling
algorithms for single variables. In addition, we also propose variants of
different data partitioning schemes (regular and irregular) to efficiently
model the local multivariate relationships. Using two real-world multivariate
data sets, we demonstrate the efficacy of our proposed multivariate sampling
strategy with respect to its data reduction capabilities as well as the ease of
performing efficient post-hoc multivariate analyses.
Related papers
- Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions.
A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z) - Sparse outlier-robust PCA for multi-source data [2.3226893628361687]
We introduce a novel PCA methodology that simultaneously selects important features as well as local source-specific patterns.
We develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns.
We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier.
arXiv Detail & Related papers (2024-07-23T08:55:03Z) - Analysing Multi-Task Regression via Random Matrix Theory with Application to Time Series Forecasting [16.640336442849282]
We formulate a multi-task optimization problem as a regularization technique to enable single-task models to leverage multi-task learning information.
We derive a closed-form solution for multi-task optimization in the context of linear models.
arXiv Detail & Related papers (2024-06-14T17:59:25Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Multi-Task Learning with Summary Statistics [4.871473117968554]
We propose a flexible multi-task learning framework utilizing summary statistics from various sources.
We also present an adaptive parameter selection approach based on a variant of Lepski's method.
This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction.
arXiv Detail & Related papers (2023-07-05T15:55:23Z) - Multivariate regression modeling in integrative analysis via sparse
regularization [0.0]
Integrative analysis is an effective method to pool useful information from multiple independent datasets.
The integration is achieved by sparse estimation that performs variable and group selection.
The performance of the proposed method is demonstrated through Monte Carlo simulation and analyzing wastewater treatment data with microbe measurements.
arXiv Detail & Related papers (2023-04-15T02:27:51Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Consistency and Diversity induced Human Motion Segmentation [231.36289425663702]
We propose a novel Consistency and Diversity induced human Motion (CDMS) algorithm.
Our model factorizes the source and target data into distinct multi-layer feature spaces.
A multi-mutual learning strategy is carried out to reduce the domain gap between the source and target data.
arXiv Detail & Related papers (2022-02-10T06:23:56Z) - Privacy-preserving Logistic Regression with Secret Sharing [0.0]
We propose secret sharing-based privacy-preserving logistic regression protocols using the Newton-Raphson method.
Our implementation results show that our improved method can handle large datasets used in securely training a logistic regression from multiple sources.
arXiv Detail & Related papers (2021-05-14T14:53:50Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Deep Representational Similarity Learning for analyzing neural
signatures in task-based fMRI dataset [81.02949933048332]
This paper develops Deep Representational Similarity Learning (DRSL), a deep extension of Representational Similarity Analysis (RSA)
DRSL is appropriate for analyzing similarities between various cognitive tasks in fMRI datasets with a large number of subjects.
arXiv Detail & Related papers (2020-09-28T18:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.