Related papers: The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification

The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification

URL: http://arxiv.org/abs/2108.05386v1
Date: Wed, 11 Aug 2021 18:07:07 GMT
Title: The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification
Authors: Vasileios Baltatzis, Kyriaki-Margarita Bintsi, Loic Le Folgoc, Octavio E. Martinez Manzanera, Sam Ellis, Arjun Nair, Sujal Desai, Ben Glocker, Julia A. Schnabel
Abstract summary: In lung nodule classification, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the performance of proposed methods and assess the impact of individual contributions. We find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases. We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another.
Score: 13.376247652484274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Using publicly available data to determine the performance of methodological contributions is important as it facilitates reproducibility and allows scrutiny of the published results. In lung nodule classification, for example, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the performance of proposed methods and assess the impact of individual contributions. When analyzing seven recent works, however, we find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases. As each subset will have different characteristics with varying difficulty for classification, a direct comparison between the proposed methods is thus not always possible, nor fair. We study the particular effect of truthing when aggregating labels from multiple experts. We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another. While we show that we can further improve on the state-of-the-art on one sample selection, we also find that on a more challenging sample selection, on the same database, the more advanced models underperform with respect to very simple baseline methods, highlighting that the selected data distribution may play an even more important role than the model architecture. This raises concerns about the validity of claimed methodological contributions. We believe the community should be aware of these pitfalls and make recommendations on how these can be avoided in future work.

Related papers

Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. We show that selection structure is identifiable without any parametric assumptions or interventional experiments. We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z)
In Search of Insights, Not Magic Bullets: Towards Demystification of the Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria. We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z)
Equivariance Allows Handling Multiple Nuisance Variables When Analyzing Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution. We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z)
Undersmoothing Causal Estimators with Generative Trees [0.0]
Inferring individualised treatment effects from observational data can unlock the potential for targeted interventions. It is, however, hard to infer these effects from observational data. In this paper, we explore a novel generative tree based approach that tackles model misspecification directly.
arXiv Detail & Related papers (2022-03-16T11:59:38Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
SelectAugment: Hierarchical Deterministic Sample Selection for Data Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner. Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio. In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z)
An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification [4.506770920842088]
This study focuses on the synergy between feature selection and data resampling for imbalance classification. We conduct a large amount of experiments on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms.
arXiv Detail & Related papers (2021-09-01T06:01:51Z)
Investigate the Essence of Long-Tailed Recognition from a Unified Perspective [11.080317683184363]
deep recognition models often suffer from long-tailed data distributions due to heavy imbalanced sample number across categories. In this work, we demonstrate that long-tailed recognition suffers from both sample number and category similarity.
arXiv Detail & Related papers (2021-07-08T11:08:40Z)
Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round. Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.