Related papers: Machine Learning Methods for Small Data and Upstream Bioprocessing Applications: A Comprehensive Review

Machine Learning Methods for Small Data and Upstream Bioprocessing Applications: A Comprehensive Review

URL: http://arxiv.org/abs/2506.12322v2
Date: Fri, 20 Jun 2025 12:36:26 GMT
Title: Machine Learning Methods for Small Data and Upstream Bioprocessing Applications: A Comprehensive Review
Authors: Johnny Peng, Thanh Tung Khuat, Katarzyna Musial, Bogdan Gabrys,
Abstract summary: Data is crucial for machine learning (ML) applications, yet acquiring large datasets can be costly and time-consuming.<n>This review explores ML methods designed to address the challenges posed by small data and classifies them into a taxonomy to guide practical applications.<n>By analysing how these methods tackle small data challenges from different perspectives, this review provides actionable insights.
Score: 13.205760966688619
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Data is crucial for machine learning (ML) applications, yet acquiring large datasets can be costly and time-consuming, especially in complex, resource-intensive fields like biopharmaceuticals. A key process in this industry is upstream bioprocessing, where living cells are cultivated and optimised to produce therapeutic proteins and biologics. The intricate nature of these processes, combined with high resource demands, often limits data collection, resulting in smaller datasets. This comprehensive review explores ML methods designed to address the challenges posed by small data and classifies them into a taxonomy to guide practical applications. Furthermore, each method in the taxonomy was thoroughly analysed, with a detailed discussion of its core concepts and an evaluation of its effectiveness in tackling small data challenges, as demonstrated by application results in the upstream bioprocessing and other related domains. By analysing how these methods tackle small data challenges from different perspectives, this review provides actionable insights, identifies current research gaps, and offers guidance for leveraging ML in data-constrained environments.

Related papers

Improving the Generation and Evaluation of Synthetic Data for Downstream Medical Causal Inference [89.5628648718851]
Causal inference is essential for developing and evaluating medical interventions.<n>Real-world medical datasets are often difficult to access due to regulatory barriers.<n>We present STEAM: a novel method for generating Synthetic data for Treatment Effect Analysis in Medicine.
arXiv Detail & Related papers (2025-10-21T16:16:00Z)
Biological Sequence with Language Model Prompting: A Survey [14.270959261105968]
Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains.<n>This paper systematically investigates the application of prompt-based methods with LLMs to biological sequences.
arXiv Detail & Related papers (2025-03-06T06:28:36Z)
Knowledge Hierarchy Guided Biological-Medical Dataset Distillation for Domain LLM Training [10.701353329227722]
We propose a framework that automates the distillation of high-quality textual training data from the extensive scientific literature.<n>Our approach self-evaluates and generates questions that are more closely aligned with the biomedical domain.<n>Our approach substantially improves question-answering tasks compared to pre-trained models from the life sciences domain.
arXiv Detail & Related papers (2025-01-25T07:20:44Z)
An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z)
Into the Single Cell Multiverse: an End-to-End Dataset for Procedural Knowledge Extraction in Biomedical Texts [2.2578044590557553]
FlaMB'e is a collection of expert-curated datasets that capture procedural knowledge in biomedical texts. The dataset is inspired by the observation that one ubiquitous source of procedural knowledge that is described as unstructured text is within academic papers describing their methodology.
arXiv Detail & Related papers (2023-09-04T21:02:36Z)
A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power. Deep learning technology has developed unprecedentedly in the last decade. This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z)
Multi-fidelity Gaussian Process for Biomanufacturing Process Modeling with Small Data [1.4687789417816917]
We propose to use a statistical machine learning approach, multi-fidelity Gaussian process, for process modelling in biomanufacturing. We apply the multi-fidelity Gaussian process to solve two significant problems in biomanufacturing, bioreactor scale-up and knowledge transfer across cell lines, and demonstrate its efficacy on real-world datasets.
arXiv Detail & Related papers (2022-11-26T06:38:34Z)
Machine learning in bioprocess development: From promise to practice [58.720142291102135]
Data-driven methods like machine learning (ML) approaches have a high potential to rationally explore large design spaces. The aim of this review is to demonstrate how ML methods have been applied so far in bioprocess development.
arXiv Detail & Related papers (2022-10-04T13:48:59Z)
Deep neural networks approach to microbial colony detection -- a comparative analysis [52.77024349608834]
This study investigates the performance of three deep learning approaches for object detection on the AGAR dataset. The achieved results may serve as a benchmark for future experiments.
arXiv Detail & Related papers (2021-08-23T12:06:00Z)
Towards an Automatic Analysis of CHO-K1 Suspension Growth in Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data. Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
Machine Learning in Nano-Scale Biomedical Engineering [77.75587007080894]
We review the existing research regarding the use of machine learning in nano-scale biomedical engineering. The main challenges that can be formulated as ML problems are classified into the three main categories. For each of the presented methodologies, special emphasis is given to its principles, applications, and limitations.
arXiv Detail & Related papers (2020-08-05T15:45:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.