The Bearable Lightness of Big Data: Towards Massive Public Datasets in
Scientific Machine Learning
- URL: http://arxiv.org/abs/2207.12546v1
- Date: Mon, 25 Jul 2022 21:44:53 GMT
- Title: The Bearable Lightness of Big Data: Towards Massive Public Datasets in
Scientific Machine Learning
- Authors: Wai Tong Chung and Ki Sung Jung and Jacqueline H. Chen and Matthias
Ihme
- Abstract summary: We show that lossy compression algorithms offer a realistic pathway for exposing high-fidelity scientific data to open-source data repositories.
In this paper, we outline, construct, and evaluate the requirements for establishing a big data framework.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In general, large datasets enable deep learning models to perform with good
accuracy and generalizability. However, massive high-fidelity simulation
datasets (from molecular chemistry, astrophysics, computational fluid dynamics
(CFD), etc. can be challenging to curate due to dimensionality and storage
constraints. Lossy compression algorithms can help mitigate limitations from
storage, as long as the overall data fidelity is preserved. To illustrate this
point, we demonstrate that deep learning models, trained and tested on data
from a petascale CFD simulation, are robust to errors introduced during lossy
compression in a semantic segmentation problem. Our results demonstrate that
lossy compression algorithms offer a realistic pathway for exposing
high-fidelity scientific data to open-source data repositories for building
community datasets. In this paper, we outline, construct, and evaluate the
requirements for establishing a big data framework, demonstrated at
https://blastnet.github.io/, for scientific machine learning.
Related papers
- Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model [10.634513279883913]
We propose a variational autoencoder framework tailored for compressing high-resolution datasets.
Our framework successfully reduced the storage size of 3 years of HRCLDAS data from 8.61 TB to just 204 GB, while preserving essential information.
arXiv Detail & Related papers (2024-10-10T05:38:03Z) - NeurLZ: On Enhancing Lossy Compression Performance based on Error-Controlled Neural Learning for Scientific Data [35.36879818366783]
Large-scale scientific simulations generate massive datasets that pose challenges for storage and I/O.
We propose NeurLZ, a novel cross-field learning-based and error-controlled compression framework for scientific data.
arXiv Detail & Related papers (2024-09-09T16:48:09Z) - Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - Generative Expansion of Small Datasets: An Expansive Graph Approach [13.053285552524052]
We introduce an Expansive Synthesis model generating large-scale, information-rich datasets from minimal samples.
An autoencoder with self-attention layers and optimal transport refines distributional consistency.
Results show comparable performance, demonstrating the model's potential to augment training data effectively.
arXiv Detail & Related papers (2024-06-25T02:59:02Z) - Computationally and Memory-Efficient Robust Predictive Analytics Using Big Data [0.0]
This study navigates through the challenges of data uncertainties, storage limitations, and predictive data-driven modeling using big data.
We utilize Robust Principal Component Analysis (RPCA) for effective noise reduction and outlier elimination, and Optimal Sensor Placement (OSP) for efficient data compression and storage.
arXiv Detail & Related papers (2024-03-27T22:39:08Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - Deep Generative Modeling-based Data Augmentation with Demonstration
using the BFBT Benchmark Void Fraction Datasets [3.341975883864341]
This paper explores the applications of deep generative models (DGMs) that have been widely used for image data generation to scientific data augmentation.
Once trained, DGMs can be used to generate synthetic data that are similar to the training data and significantly expand the dataset size.
arXiv Detail & Related papers (2023-08-19T22:19:41Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - Minimizing the Accumulated Trajectory Error to Improve Dataset
Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory.
We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory.
Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.