Preventing dataset shift from breaking machine-learning biomarkers
- URL: http://arxiv.org/abs/2107.09947v1
- Date: Wed, 21 Jul 2021 08:54:23 GMT
- Title: Preventing dataset shift from breaking machine-learning biomarkers
- Authors: J\'ero\^ome Dock\`es, Ga\"el Varoquaux (PARIETAL), Jean-Baptiste
Poline
- Abstract summary: A good biomarker is one that gives reliable detection of the corresponding condition.
Biomarkers are often extracted from a cohort that differs from the target population.
Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals.
- Score: 0.6138671548064355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning brings the hope of finding new biomarkers extracted from
cohorts with rich biomedical measurements. A good biomarker is one that gives
reliable detection of the corresponding condition. However, biomarkers are
often extracted from a cohort that differs from the target population. Such a
mismatch, known as a dataset shift, can undermine the application of the
biomarker to new individuals. Dataset shifts are frequent in biomedical
research, e.g. because of recruitment biases. When a dataset shift occurs,
standard machine-learning techniques do not suffice to extract and validate
biomarkers. This article provides an overview of when and how dataset shifts
breaks machine-learning extracted biomarkers, as well as detection and
correction strategies.
Related papers
- Revolutionizing Biomarker Discovery: Leveraging Generative AI for Bio-Knowledge-Embedded Continuous Space Exploration [20.419747013569268]
We propose a new biomarker identification framework with two important modules: training data preparation and embedding-optimization-generation.
The first module uses a multi-agent system to automatically collect pairs of biomarker subsets and their corresponding prediction accuracy as training data.
The second module employs an encoder-evaluator-decoder learning paradigm to compress the knowledge of the collected data into a continuous space.
arXiv Detail & Related papers (2024-09-23T23:36:30Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - Machine Learning Driven Biomarker Selection for Medical Diagnosis [1.10252115875756]
Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously.
This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer.
The use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations.
arXiv Detail & Related papers (2024-05-16T01:30:47Z) - BioAug: Conditional Generation based Data Augmentation for Low-Resource
Biomedical NER [52.79573512427998]
We present BioAug, a novel data augmentation framework for low-resource BioNER.
BioAug is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation.
We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets.
arXiv Detail & Related papers (2023-05-18T02:04:38Z) - Regression-based Deep-Learning predicts molecular biomarkers from
pathology slides [40.24757332810004]
We developed and evaluated a new self-supervised attention-based weakly supervised regression method that predicts continuous biomarkers directly from images.
Using regression significantly enhances the accuracy of biomarker prediction, while also improving the interpretability of the results over classification.
Our open-source regression approach offers a promising alternative for continuous biomarker analysis in computational pathology.
arXiv Detail & Related papers (2023-04-11T11:43:51Z) - Clinical Contrastive Learning for Biomarker Detection [15.510581400494207]
We exploit the relationship between clinical and biomarker data to improve performance for biomarker classification.
This is accomplished by leveraging the larger amount of clinical data as pseudo-labels for our data without biomarker labels.
Our method is shown to outperform state of the art self-supervised methods by as much as 5% in terms of accuracy on individual biomarker detection.
arXiv Detail & Related papers (2022-11-09T18:29:56Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - MIIDL: a Python package for microbial biomarkers identification powered
by interpretable deep learning [5.749346757892117]
We present MIIDL, a Python package for the identification of microbial biomarkers based on interpretable deep learning.
MIIDL innovatively applies convolutional neural networks, a variety of interpretability algorithms and plenty of pre-processing methods to provide a one-stop and robust pipeline for microbial biomarkers identification from high-dimensional and sparse data sets.
arXiv Detail & Related papers (2021-09-24T21:30:10Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Automatic image-based identification and biomass estimation of
invertebrates [70.08255822611812]
Time-consuming sorting and identification of taxa pose strong limitations on how many insect samples can be processed.
We propose to replace the standard manual approach of human expert-based sorting and identification with an automatic image-based technology.
We use state-of-the-art Resnet-50 and InceptionV3 CNNs for the classification task.
arXiv Detail & Related papers (2020-02-05T21:38:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.