When are Deep Networks really better than Random Forests at small sample
sizes?
- URL: http://arxiv.org/abs/2108.13637v1
- Date: Tue, 31 Aug 2021 06:33:17 GMT
- Title: When are Deep Networks really better than Random Forests at small sample
sizes?
- Authors: Haoyin Xu, Michael Ainsworth, Yu-Chung Peng, Madi Kusmanov, Sambit
Panda, Joshua T. Vogelstein
- Abstract summary: Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature.
We wish to further explore and establish the conditions and domains in which each approach excels.
Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets.
- Score: 2.5556070792288934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random forests (RF) and deep networks (DN) are two of the most popular
machine learning methods in the current scientific literature and yield
differing levels of performance on different data modalities. We wish to
further explore and establish the conditions and domains in which each approach
excels, particularly in the context of sample size and feature dimension. To
address these issues, we tested the performance of these approaches across
tabular, image, and audio settings using varying model parameters and
architectures. Our focus is on datasets with at most 10,000 samples, which
represent a large fraction of scientific and biomedical datasets. In general,
we found RF to excel at tabular and structured data (image and audio) with
small sample sizes, whereas DN performed better on structured data with larger
sample sizes. Although we plan to continue updating this technical report in
the coming months, we believe the current preliminary results may be of
interest to others.
Related papers
- Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - Data Augmentations in Deep Weight Spaces [89.45272760013928]
We introduce a novel augmentation scheme based on the Mixup method.
We evaluate the performance of these techniques on existing benchmarks as well as new benchmarks we generate.
arXiv Detail & Related papers (2023-11-15T10:43:13Z) - Weight Predictor Network with Feature Selection for Small Sample Tabular
Biomedical Data [7.923088041693465]
We propose Weight Predictor Network with Feature Selection for learning neural networks from high-dimensional and small sample data.
We evaluate on nine real-world biomedical datasets and demonstrate that WPFS outperforms other standard as well as more recent methods.
arXiv Detail & Related papers (2022-11-28T18:17:10Z) - ScoreMix: A Scalable Augmentation Strategy for Training GANs with
Limited Data [93.06336507035486]
Generative Adversarial Networks (GANs) typically suffer from overfitting when limited training data is available.
We present ScoreMix, a novel and scalable data augmentation approach for various image synthesis tasks.
arXiv Detail & Related papers (2022-10-27T02:55:15Z) - A Data-Centric AI Paradigm Based on Application-Driven Fine-grained
Dataset Design [2.2223262422197907]
We propose a novel paradigm for fine-grained design of datasets, driven by industrial applications.
We flexibly select positive and negative sample sets according to the essential features of the data and application requirements.
Compared with the traditional data design methods, our method achieves better results and effectively reduces false alarm.
arXiv Detail & Related papers (2022-09-20T03:56:53Z) - On the data requirements of probing [20.965328323152608]
We present a novel method to estimate the required number of data samples for probing datasets.
Our framework helps to systematically construct probing datasets to diagnose neural NLP models.
arXiv Detail & Related papers (2022-02-25T16:27:06Z) - Multi-Domain Joint Training for Person Re-Identification [51.73921349603597]
Deep learning-based person Re-IDentification (ReID) often requires a large amount of training data to achieve good performance.
It appears that collecting more training data from diverse environments tends to improve the ReID performance.
We propose an approach called Domain-Camera-Sample Dynamic network (DCSD) whose parameters can be adaptive to various factors.
arXiv Detail & Related papers (2022-01-06T09:20:59Z) - Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one.
Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP.
We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z) - Convolution Neural Networks for Semantic Segmentation: Application to
Small Datasets of Biomedical Images [0.0]
This thesis studies how the segmentation results, produced by convolutional neural networks (CNN), is different from each other when applied to small biomedical datasets.
Two working datasets are from biomedical area of research.
arXiv Detail & Related papers (2020-11-01T19:09:12Z) - A Close Look at Deep Learning with Small Data [0.0]
We show that model complexity is a critical factor when only a few samples per class are available.
We also show that even standard data augmentation can boost recognition performance by large margins.
arXiv Detail & Related papers (2020-03-28T17:11:29Z) - NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization [101.13851473792334]
We construct a large-scale congested crowd counting and localization dataset, NWPU-Crowd, consisting of 5,109 images, in a total of 2,133,375 annotated heads with points and boxes.
Compared with other real-world datasets, it contains various illumination scenes and has the largest density range (020,033)
We describe the data characteristics, evaluate the performance of some mainstream state-of-the-art (SOTA) methods, and analyze the new problems that arise on the new data.
arXiv Detail & Related papers (2020-01-10T09:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.