WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks
- URL: http://arxiv.org/abs/2412.02780v1
- Date: Tue, 03 Dec 2024 19:20:27 GMT
- Title: WxC-Bench: A Novel Dataset for Weather and Climate Downstream Tasks
- Authors: Rajat Shinde, Christopher E. Phillips, Kumar Ankur, Aman Gupta, Simon Pfreundschuh, Sujit Roy, Sheyenne Kirkland, Vishal Gaur, Amy Lin, Aditi Sheshadri, Udaysankar Nair, Manil Maskey, Rahul Ramachandran,
- Abstract summary: High-quality machine learning (ML)-ready datasets play a foundational role in developing new artificial intelligence (AI) models.
Here we introduce WxC-Bench, a multi-modal dataset designed to support the development of generalizable AI models.
We provide a comprehensive description of the dataset and also present a technical validation for baseline analysis.
- Score: 1.0369983700531806
- License:
- Abstract: High-quality machine learning (ML)-ready datasets play a foundational role in developing new artificial intelligence (AI) models or fine-tuning existing models for scientific applications such as weather and climate analysis. Unfortunately, despite the growing development of new deep learning models for weather and climate, there is a scarcity of curated, pre-processed machine learning (ML)-ready datasets. Curating such high-quality datasets for developing new models is challenging particularly because the modality of the input data varies significantly for different downstream tasks addressing different atmospheric scales (spatial and temporal). Here we introduce WxC-Bench (Weather and Climate Bench), a multi-modal dataset designed to support the development of generalizable AI models for downstream use-cases in weather and climate research. WxC-Bench is designed as a dataset of datasets for developing ML-models for a complex weather and climate system, addressing selected downstream tasks as machine learning phenomenon. WxC-Bench encompasses several atmospheric processes from meso-$\beta$ (20 - 200 km) scale to synoptic scales (2500 km), such as aviation turbulence, hurricane intensity and track monitoring, weather analog search, gravity wave parameterization, and natural language report generation. We provide a comprehensive description of the dataset and also present a technical validation for baseline analysis. The dataset and code to prepare the ML-ready data have been made publicly available on Hugging Face -- https://huggingface.co/datasets/nasa-impact/WxC-Bench
Related papers
- Generalizing Weather Forecast to Fine-grained Temporal Scales via Physics-AI Hybrid Modeling [55.13352174687475]
This paper proposes a physics-AI hybrid model (i.e., WeatherGFT) which generalizes weather forecasts to finer-grained temporal scales beyond training dataset.
Specifically, we employ a carefully designed PDE kernel to simulate physical evolution on a small time scale.
We also introduce a lead time-aware training framework to promote the generalization of the model at different lead times.
arXiv Detail & Related papers (2024-05-22T16:21:02Z) - ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning [26.151056828513962]
Climate models have been key for assessing the impact of climate change and simulating future climate scenarios.
The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks.
Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives.
arXiv Detail & Related papers (2023-11-07T04:55:36Z) - Pushing the Limits of Pre-training for Time Series Forecasting in the
CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain.
We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size.
Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z) - Unleashing Realistic Air Quality Forecasting: Introducing the
Ready-to-Use PurpleAirSF Dataset [4.190243190157989]
This paper introduces PurpleAirSF, a comprehensive and easily accessible dataset from the PurpleAir network.
We present a detailed account of the data collection and processing methods employed to build PurpleAirSF.
We conduct preliminary experiments using both classic and modern-temporal forecasting models, thereby establishing a benchmark for future air quality forecasting tasks.
arXiv Detail & Related papers (2023-06-24T12:10:16Z) - ClimaX: A foundation model for weather and climate [51.208269971019504]
ClimaX is a deep learning model for weather and climate science.
It can be pre-trained with a self-supervised learning objective on climate datasets.
It can be fine-tuned to address a breadth of climate and weather tasks.
arXiv Detail & Related papers (2023-01-24T23:19:01Z) - Learning to Simulate Realistic LiDARs [66.7519667383175]
We introduce a pipeline for data-driven simulation of a realistic LiDAR sensor.
We show that our model can learn to encode realistic effects such as dropped points on transparent surfaces.
We use our technique to learn models of two distinct LiDAR sensors and use them to improve simulated LiDAR data accordingly.
arXiv Detail & Related papers (2022-09-22T13:12:54Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - SubseasonalClimateUSA: A Dataset for Subseasonal Forecasting and
Benchmarking [20.442879707675115]
SubseasonalClimateUSA is a curated dataset for training and benchmarking subseasonal forecasting models in the United States.
We use this dataset to benchmark a diverse suite of models, including operational dynamical models, classical meteorological baselines, and ten state-of-the-art machine learning and deep learning-based methods from the literature.
arXiv Detail & Related papers (2021-09-21T18:42:10Z) - WeatherBench: A benchmark dataset for data-driven weather forecasting [17.76377510880905]
We present a benchmark dataset for data-driven medium-range weather forecasting.
We provide data derived from the ERA5 archive that has been processed to facilitate the use in machine learning models.
We provide baseline scores from simple linear regression techniques, deep learning models, as well as purely physical forecasting models.
arXiv Detail & Related papers (2020-02-02T19:20:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.