The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning
- URL: http://arxiv.org/abs/2506.07619v1
- Date: Mon, 09 Jun 2025 10:34:14 GMT
- Title: The Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning
- Authors: Toby Boyne, Juan S. Campos, Becky D. Langdon, Jixiang Qing, Yilin Xie, Shiqiang Zhang, Calvin Tsay, Ruth Misener, Daniel W. Davies, Kim E. Jelfs, Sarah Boyall, Thomas M. Dixon, Linden Schrecker, Jose Pablo Folch,
- Abstract summary: We introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking.<n>While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions.<n>We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications.
- Score: 4.864188241160383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.
Related papers
- ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning [0.0]
Data preprocessing is a critical yet frequently neglected aspect of machine learning.<n>CleanSurvival is a reinforcement-learning-based solution for optimizing preprocessing pipelines.<n>It can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance.
arXiv Detail & Related papers (2025-02-06T10:33:37Z) - Machine learning in wastewater treatment: insights from modelling a pilot denitrification reactor [0.0]
We use data from a pilot reactor at the Veas treatment facility in Norway to explore how machine learning can be used to optimize biological nitrate.<n>Rather than focusing solely on predictive accuracy, our approach prioritizes understanding the foundational requirements for effective data-driven modelling.
arXiv Detail & Related papers (2024-12-18T16:49:23Z) - Intelligent Chemical Purification Technique Based on Machine Learning [5.023197681500998]
We present an innovative of artificial intelligence with column chromatography, aiming to resolve inefficiencies and standardize data collection in chemical separation and purification domain.
By developing an automated platform for precise data acquisition and employing advanced machine learning algorithms, we constructed predictive models to forecast key separation parameters.
A novel metric, separation probability ($S_p$), quantifies the likelihood of effective compound separation, validated through experimental verification.
arXiv Detail & Related papers (2024-04-14T01:44:58Z) - Retrosynthesis prediction enhanced by in-silico reaction data
augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation.
On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Predicting Chemical Properties using Self-Attention Multi-task Learning
based on SMILES Representation [0.0]
In this study, we explore the structural differences of the transformer-variant model and proposed a new self-attention based model.
The representation learning performance of the self-attention module was evaluated in a multi-task learning environment using imbalanced chemical datasets.
arXiv Detail & Related papers (2020-10-19T09:46:50Z) - Towards the Automation of a Chemical Sulphonation Process with Machine
Learning [0.0]
This paper presents the results of applying machine learning methods during a chemical sulphonation process.
We used data from process parameters to train different models including Random Forest, Neural Network and linear regression.
Our experiments showed that it is possible to predict those product quality values with good accuracy, thus, having the potential to reduce time.
arXiv Detail & Related papers (2020-09-25T10:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.