On the choice of training data for machine learning of geostrophic
mesoscale turbulence
- URL: http://arxiv.org/abs/2307.00734v1
- Date: Mon, 3 Jul 2023 03:43:21 GMT
- Title: On the choice of training data for machine learning of geostrophic
mesoscale turbulence
- Authors: F. E. Yan, J. Mak, Y. Wang
- Abstract summary: 'Data' plays a central role in data-driven methods, but is not often the subject of focus in investigations of machine learning algorithms.
We consider the case of eddy-mean interaction in rotating stratified turbulence in the presence of lateral boundaries.
- Score: 0.34376560669160383
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 'Data' plays a central role in data-driven methods, but is not often the
subject of focus in investigations of machine learning algorithms as applied to
Earth System Modeling related problems. Here we consider the case of eddy-mean
interaction in rotating stratified turbulence in the presence of lateral
boundaries, a problem of relevance to ocean modeling, where the eddy fluxes
contain dynamically inert rotational components that are expected to
contaminate the learning process. An often utilized choice in the literature is
to learn from the divergence of the eddy fluxes. Here we provide theoretical
arguments and numerical evidence that learning from the eddy fluxes with the
rotational component appropriately filtered out results in models with
comparable or better skill, but substantially improved robustness. If we simply
want a data-driven model to have predictive skill then the choice of data
choice and/or quality may not be critical, but we argue it is highly desirable
and perhaps even necessary if we want to leverage data-driven methods to aid in
discovering unknown or hidden physical processes within the data itself.
Related papers
- Machine learning in wastewater treatment: insights from modelling a pilot denitrification reactor [0.0]
We use data from a pilot reactor at the Veas treatment facility in Norway to explore how machine learning can be used to optimize biological nitrate.
Rather than focusing solely on predictive accuracy, our approach prioritizes understanding the foundational requirements for effective data-driven modelling.
arXiv Detail & Related papers (2024-12-18T16:49:23Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings.
Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data.
Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z) - Data Models for Dataset Drift Controls in Machine Learning With Optical
Images [8.818468649062932]
A primary failure mode are performance drops due to differences between the training and deployment data.
Existing approaches do not account for explicit models of the primary object of interest: the data.
We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift.
arXiv Detail & Related papers (2022-11-04T16:50:10Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - Deep Active Learning with Noise Stability [24.54974925491753]
Uncertainty estimation for unlabeled data is crucial to active learning.
We propose a novel algorithm that leverages noise stability to estimate data uncertainty.
Our method is generally applicable in various tasks, including computer vision, natural language processing, and structural data analysis.
arXiv Detail & Related papers (2022-05-26T13:21:01Z) - Architectural Optimization and Feature Learning for High-Dimensional
Time Series Datasets [0.7388859384645262]
We study the problem of predicting the presence of transient noise artifacts in a gravitational wave detector.
We introduce models that reduce the error rate by over 60% compared to the previous state of the art.
arXiv Detail & Related papers (2022-02-27T23:41:23Z) - Deep invariant networks with differentiable augmentation layers [87.22033101185201]
Methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems.
We show that our approach is easier and faster to train than modern automatic data augmentation techniques.
arXiv Detail & Related papers (2022-02-04T14:12:31Z) - Using Data Assimilation to Train a Hybrid Forecast System that Combines
Machine-Learning and Knowledge-Based Components [52.77024349608834]
We consider the problem of data-assisted forecasting of chaotic dynamical systems when the available data is noisy partial measurements.
We show that by using partial measurements of the state of the dynamical system, we can train a machine learning model to improve predictions made by an imperfect knowledge-based model.
arXiv Detail & Related papers (2021-02-15T19:56:48Z) - Model-Based Deep Learning [155.063817656602]
Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques.
Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance.
We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches.
arXiv Detail & Related papers (2020-12-15T16:29:49Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.