The Pitfalls of Sample Selection: A Case Study on Lung Nodule
Classification
- URL: http://arxiv.org/abs/2108.05386v1
- Date: Wed, 11 Aug 2021 18:07:07 GMT
- Title: The Pitfalls of Sample Selection: A Case Study on Lung Nodule
Classification
- Authors: Vasileios Baltatzis, Kyriaki-Margarita Bintsi, Loic Le Folgoc, Octavio
E. Martinez Manzanera, Sam Ellis, Arjun Nair, Sujal Desai, Ben Glocker, Julia
A. Schnabel
- Abstract summary: In lung nodule classification, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the performance of proposed methods and assess the impact of individual contributions.
We find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases.
We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another.
- Score: 13.376247652484274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Using publicly available data to determine the performance of methodological
contributions is important as it facilitates reproducibility and allows
scrutiny of the published results. In lung nodule classification, for example,
many works report results on the publicly available LIDC dataset. In theory,
this should allow a direct comparison of the performance of proposed methods
and assess the impact of individual contributions. When analyzing seven recent
works, however, we find that each employs a different data selection process,
leading to largely varying total number of samples and ratios between benign
and malignant cases. As each subset will have different characteristics with
varying difficulty for classification, a direct comparison between the proposed
methods is thus not always possible, nor fair. We study the particular effect
of truthing when aggregating labels from multiple experts. We show that
specific choices can have severe impact on the data distribution where it may
be possible to achieve superior performance on one sample distribution but not
on another. While we show that we can further improve on the state-of-the-art
on one sample selection, we also find that on a more challenging sample
selection, on the same database, the more advanced models underperform with
respect to very simple baseline methods, highlighting that the selected data
distribution may play an even more important role than the model architecture.
This raises concerns about the validity of claimed methodological
contributions. We believe the community should be aware of these pitfalls and
make recommendations on how these can be avoided in future work.
Related papers
- Detecting and Identifying Selection Structure in Sequential Data [53.24493902162797]
We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences.
We show that selection structure is identifiable without any parametric assumptions or interventional experiments.
We also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies.
arXiv Detail & Related papers (2024-06-29T20:56:34Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Undersmoothing Causal Estimators with Generative Trees [0.0]
Inferring individualised treatment effects from observational data can unlock the potential for targeted interventions.
It is, however, hard to infer these effects from observational data.
In this paper, we explore a novel generative tree based approach that tackles model misspecification directly.
arXiv Detail & Related papers (2022-03-16T11:59:38Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z) - An Empirical Study on the Joint Impact of Feature Selection and Data
Resampling on Imbalance Classification [4.506770920842088]
This study focuses on the synergy between feature selection and data resampling for imbalance classification.
We conduct a large amount of experiments on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms.
arXiv Detail & Related papers (2021-09-01T06:01:51Z) - Investigate the Essence of Long-Tailed Recognition from a Unified
Perspective [11.080317683184363]
deep recognition models often suffer from long-tailed data distributions due to heavy imbalanced sample number across categories.
In this work, we demonstrate that long-tailed recognition suffers from both sample number and category similarity.
arXiv Detail & Related papers (2021-07-08T11:08:40Z) - Online Active Model Selection for Pre-trained Classifiers [72.84853880948894]
We design an online selective sampling approach that actively selects informative examples to label and outputs the best model with high probability at any round.
Our algorithm can be used for online prediction tasks for both adversarial and streams.
arXiv Detail & Related papers (2020-10-19T19:53:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.