Inflation of test accuracy due to data leakage in deep learning-based
classification of OCT images
- URL: http://arxiv.org/abs/2202.12267v1
- Date: Mon, 21 Feb 2022 14:08:42 GMT
- Title: Inflation of test accuracy due to data leakage in deep learning-based
classification of OCT images
- Authors: Iulian Emil Tampu, Anders Eklund and Neda Haj-Hosseini
- Abstract summary: In this study, the effect of improper dataset splitting on model evaluation is demonstrated for two classification tasks.
Our results show that the classification accuracy is inflated by 3.9 to 26 percentage units for models tested on a dataset with improper splitting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the application of deep learning on optical coherence tomography (OCT)
data, it is common to train classification networks using 2D images originating
from volumetric data. Given the micrometer resolution of OCT systems,
consecutive images are often very similar in both visible structures and noise.
Thus, an inappropriate data split can result in overlap between the training
and testing sets, with a large portion of the literature overlooking this
aspect. In this study, the effect of improper dataset splitting on model
evaluation is demonstrated for two classification tasks using two OCT
open-access datasets extensively used in the literature, Kermany's
ophthalmology dataset and AIIMS breast tissue dataset. Our results show that
the classification accuracy is inflated by 3.9 to 26 percentage units for
models tested on a dataset with improper splitting, highlighting the
considerable effect of dataset handling on model evaluation. This study intends
to raise awareness on the importance of dataset splitting for research on deep
learning using OCT data and volumetric data in general.
Related papers
- Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Defect Classification in Additive Manufacturing Using CNN-Based Vision
Processing [76.72662577101988]
This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM and second, applying active learning techniques to the developed classification model.
This allows the construction of a human-in-the-loop mechanism to reduce the size of the data required to train and generate training data.
arXiv Detail & Related papers (2023-07-14T14:36:58Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Self-supervised Model Based on Masked Autoencoders Advance CT Scans
Classification [0.0]
This paper is inspired by the self-supervised learning algorithm MAE.
It uses the MAE model pre-trained on ImageNet to perform transfer learning on CT Scans dataset.
This method improves the generalization performance of the model and avoids the risk of overfitting on small datasets.
arXiv Detail & Related papers (2022-10-11T00:52:05Z) - SD-LayerNet: Semi-supervised retinal layer segmentation in OCT using
disentangled representation with anatomical priors [4.2663199451998475]
We introduce a semi-supervised paradigm into the retinal layer segmentation task.
In particular, a novel fully differentiable approach is used for converting surface position regression into a pixel-wise structured segmentation.
In parallel, we propose a set of anatomical priors to improve network training when a limited amount of labeled data is available.
arXiv Detail & Related papers (2022-07-01T14:30:59Z) - Reducing Labelled Data Requirement for Pneumonia Segmentation using
Image Augmentations [0.0]
We investigate the effect of image augmentations on reducing the requirement of labelled data in semantic segmentation of chest X-rays for pneumonia detection.
We train fully convolutional network models on subsets of different sizes from the total training data.
We find that rotate and mixup are the best augmentations amongst rotate, mixup, translate, gamma and horizontal flip, wherein they reduce the labelled data requirement by 70%.
arXiv Detail & Related papers (2021-02-25T10:11:30Z) - The Deep Radial Basis Function Data Descriptor (D-RBFDD) Network: A
One-Class Neural Network for Anomaly Detection [7.906608953906889]
Anomaly detection is a challenging problem in machine learning.
The Radial Basis Function Data Descriptor (RBFDD) network is an effective solution for anomaly detection.
This paper investigates approaches to modifying the RBFDD network to transform it into a deep one-class classifier.
arXiv Detail & Related papers (2021-01-29T15:15:17Z) - Fader Networks for domain adaptation on fMRI: ABIDE-II study [68.5481471934606]
We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.
arXiv Detail & Related papers (2020-10-14T16:50:50Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - Data Consistent CT Reconstruction from Insufficient Data with Learned
Prior Images [70.13735569016752]
We investigate the robustness of deep learning in CT image reconstruction by showing false negative and false positive lesion cases.
We propose a data consistent reconstruction (DCR) method to improve their image quality, which combines the advantages of compressed sensing and deep learning.
The efficacy of the proposed method is demonstrated in cone-beam CT with truncated data, limited-angle data and sparse-view data, respectively.
arXiv Detail & Related papers (2020-05-20T13:30:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.