Data Augmentation Scheme for Raman Spectra with Highly Correlated
Annotations
- URL: http://arxiv.org/abs/2402.00851v1
- Date: Thu, 1 Feb 2024 18:46:28 GMT
- Title: Data Augmentation Scheme for Raman Spectra with Highly Correlated
Annotations
- Authors: Christoph Lange, Isabel Thiele, Lara Santolin, Sebastian L. Riedel,
Maxim Borisyak, Peter Neubauer and M. Nicolas Cruz Bournazou
- Abstract summary: We exploit the additive nature of spectra in order to generate additional data points from a given dataset that have statistically independent labels.
We show that training a CNN on these generated data points improves the performance on datasets where the annotations do not bear the same correlation as the dataset that was used for model training.
- Score: 0.23090185577016453
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In biotechnology Raman Spectroscopy is rapidly gaining popularity as a
process analytical technology (PAT) that measures cell densities, substrate-
and product concentrations. As it records vibrational modes of molecules it
provides that information non-invasively in a single spectrum. Typically,
partial least squares (PLS) is the model of choice to infer information about
variables of interest from the spectra. However, biological processes are known
for their complexity where convolutional neural networks (CNN) present a
powerful alternative. They can handle non-Gaussian noise and account for beam
misalignment, pixel malfunctions or the presence of additional substances.
However, they require a lot of data during model training, and they pick up
non-linear dependencies in the process variables. In this work, we exploit the
additive nature of spectra in order to generate additional data points from a
given dataset that have statistically independent labels so that a network
trained on such data exhibits low correlations between the model predictions.
We show that training a CNN on these generated data points improves the
performance on datasets where the annotations do not bear the same correlation
as the dataset that was used for model training. This data augmentation
technique enables us to reuse spectra as training data for new contexts that
exhibit different correlations. The additional data allows for building a
better and more robust model. This is of interest in scenarios where large
amounts of historical data are available but are currently not used for model
training. We demonstrate the capabilities of the proposed method using
synthetic spectra of Ralstonia eutropha batch cultivations to monitor
substrate, biomass and polyhydroxyalkanoate (PHA) biopolymer concentrations
during of the experiments.
Related papers
- Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Hodge-Aware Contrastive Learning [101.56637264703058]
Simplicial complexes prove effective in modeling data with multiway dependencies.
We develop a contrastive self-supervised learning approach for processing simplicial data.
arXiv Detail & Related papers (2023-09-14T00:40:07Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - On the Interplay of Subset Selection and Informed Graph Neural Networks [3.091456764812509]
This work focuses on predicting the molecules atomization energy in the QM9 dataset.
We show how maximizing molecular diversity in the training set selection process increases the robustness of linear and nonlinear regression techniques.
We also check the reliability of the predictions made by the graph neural network with a model-agnostic explainer.
arXiv Detail & Related papers (2023-06-15T09:09:27Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - A Federated Learning-based Industrial Health Prognostics for
Heterogeneous Edge Devices using Matched Feature Extraction [16.337207503536384]
We propose a pioneering FL-based health prognostic model with a feature similarity-matched parameter aggregation algorithm.
We show that the proposed method yields accuracy improvements as high as 44.5% and 39.3% for state-of-health estimation and remaining useful life estimation.
arXiv Detail & Related papers (2023-05-13T07:20:31Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Trustworthiness of Laser-Induced Breakdown Spectroscopy Predictions via
Simulation-based Synthetic Data Augmentation and Multitask Learning [4.633997895806144]
We consider quantitative analyses of spectral data using laser-induced breakdown spectroscopy.
We address the small size of training data available, and the validation of the predictions during inference on unknown data.
arXiv Detail & Related papers (2022-10-07T18:00:09Z) - Cycle-StarNet: Bridging the gap between theory and data by leveraging
large datasets [0.0]
Current automated methods for analyzing spectra are either (a) data-driven, which requires prior knowledge of stellar parameters and elemental abundances, or (b) based on theoretical synthetic models that are susceptible to the gap between theory and practice.
We present a hybrid generative domain adaptation method that turns simulated stellar spectra into realistic spectra by applying unsupervised learning to large spectroscopic surveys.
arXiv Detail & Related papers (2020-07-06T23:06:58Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.