DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction
- URL: http://arxiv.org/abs/2512.21433v1
- Date: Wed, 24 Dec 2025 21:46:17 GMT
- Title: DeepCQ: General-Purpose Deep-Surrogate Framework for Lossy Compression Quality Prediction
- Authors: Khondoker Mirazul Mumenin, Robert Underwood, Dong Dai, Jinzhen Wang, Sheng Di, Zarija Lukić, Franck Cappello,
- Abstract summary: We present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ)<n>Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10% across most settings.
- Score: 4.634179787231294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Error-bounded lossy compression techniques have become vital for scientific data management and analytics, given the ever-increasing volume of data generated by modern scientific simulations and instruments. Nevertheless, assessing data quality post-compression remains computationally expensive due to the intensive nature of metric calculations. In this work, we present a general-purpose deep-surrogate framework for lossy compression quality prediction (DeepCQ), with the following key contributions: 1) We develop a surrogate model for compression quality prediction that is generalizable to different error-bounded lossy compressors, quality metrics, and input datasets; 2) We adopt a novel two-stage design that decouples the computationally expensive feature-extraction stage from the light-weight metrics prediction, enabling efficient training and modular inference; 3) We optimize the model performance on time-evolving data using a mixture-of-experts design. Such a design enhances the robustness when predicting across simulation timesteps, especially when the training and test data exhibit significant variation. We validate the effectiveness of DeepCQ on four real-world scientific applications. Our results highlight the framework's exceptional predictive accuracy, with prediction errors generally under 10\% across most settings, significantly outperforming existing methods. Our framework empowers scientific users to make informed decisions about data compression based on their preferred data quality, thereby significantly reducing I/O and computational overhead in scientific data analysis.
Related papers
- Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model [20.1054266241262]
We investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data.<n>We show that encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting.
arXiv Detail & Related papers (2026-02-02T08:01:57Z) - An Investigation on Machine Learning Predictive Accuracy Improvement and Uncertainty Reduction using VAE-based Data Augmentation [2.517043342442487]
Deep generative learning uses certain ML models to learn the underlying distribution of existing data and generate synthetic samples that resemble the real data.
In this study, our objective is to evaluate the effectiveness of data augmentation using variational autoencoder (VAE)-based deep generative models.
We investigated whether the data augmentation leads to improved accuracy in the predictions of a deep neural network (DNN) model trained using the augmented data.
arXiv Detail & Related papers (2024-10-24T18:15:48Z) - Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models [0.0]
Large language models (LLMs) offer powerful capabilities but incur substantial computational costs.
This study evaluates the impact of popular compression methods on the LLaMA-2-7B model.
We show that while SparseGPT and Wanda preserve perplexity even at 50% sparsity, they suffer significant degradation on downstream tasks.
arXiv Detail & Related papers (2024-09-17T14:34:11Z) - NeurLZ: An Online Neural Learning-Based Method to Enhance Scientific Lossy Compression [34.30562110131907]
NeurLZ is a neural method designed to enhance lossy compression by integrating online learning, cross-field learning, and robust error regulation.<n>During the first five learning epochs, NeurLZ achieves an 89% bit rate reduction, with further optimization yielding up to around 94% reduction at equivalent distortion.
arXiv Detail & Related papers (2024-09-09T16:48:09Z) - CogDPM: Diffusion Probabilistic Models via Cognitive Predictive Coding [62.075029712357]
This work introduces the Cognitive Diffusion Probabilistic Models (CogDPM)
CogDPM features a precision estimation method based on the hierarchical sampling capabilities of diffusion models and weight the guidance with precision weights estimated by the inherent property of diffusion models.
We apply CogDPM to real-world prediction tasks using the United Kindom precipitation and surface wind datasets.
arXiv Detail & Related papers (2024-05-03T15:54:50Z) - Spatiotemporally adaptive compression for scientific dataset with
feature preservation -- a case study on simulation data with extreme climate
events analysis [11.299989876672605]
We propose a technique that addresses storage costs while improving post-analysis accuracy through adaptive, error-controlled lossy compression.
We integrate cyclone feature detection with data compression and demonstrate that performing adaptive error-bounded compression in higher dimensional space enables greater compression ratios.
arXiv Detail & Related papers (2024-01-06T22:32:34Z) - Transition Role of Entangled Data in Quantum Machine Learning [51.6526011493678]
Entanglement serves as the resource to empower quantum computing.
Recent progress has highlighted its positive impact on learning quantum dynamics.
We establish a quantum no-free-lunch (NFL) theorem for learning quantum dynamics using entangled data.
arXiv Detail & Related papers (2023-06-06T08:06:43Z) - Learning Sample Difficulty from Pre-trained Models for Reliable
Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization.
We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - DeepVol: Volatility Forecasting from High-Frequency Data with Dilated Causal Convolutions [53.37679435230207]
We propose DeepVol, a model based on Dilated Causal Convolutions that uses high-frequency data to forecast day-ahead volatility.
Our empirical results suggest that the proposed deep learning-based approach effectively learns global features from high-frequency data.
arXiv Detail & Related papers (2022-09-23T16:13:47Z) - Semantic Perturbations with Normalizing Flows for Improved
Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations.
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.