Sub-Setting Algorithm for Training Data Selection in Pattern Recognition
- URL: http://arxiv.org/abs/2110.06527v1
- Date: Wed, 13 Oct 2021 06:42:41 GMT
- Title: Sub-Setting Algorithm for Training Data Selection in Pattern Recognition
- Authors: AGaurav Arwade and Sigurdur Olafsson
- Abstract summary: This paper proposes a training data selection algorithm that identifies multiple subsets with simple structures.
A sub-setting algorithm identifies multiple subsets with simple local patterns by identifying similar instances in the neighborhood of an instance.
Our bottom-up sub-setting algorithm performed on an average 15% better than the top-down decision tree learned on the entire dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern pattern recognition tasks use complex algorithms that take advantage
of large datasets to make more accurate predictions than traditional algorithms
such as decision trees or k-nearest-neighbor better suited to describe simple
structures. While increased accuracy is often crucial, less complexity also has
value. This paper proposes a training data selection algorithm that identifies
multiple subsets with simple structures. A learning algorithm trained on such a
subset can classify an instance belonging to the subset with better accuracy
than the traditional learning algorithms. In other words, while existing
pattern recognition algorithms attempt to learn a global mapping function to
represent the entire dataset, we argue that an ensemble of simple local
patterns may better describe the data. Hence the sub-setting algorithm
identifies multiple subsets with simple local patterns by identifying similar
instances in the neighborhood of an instance. This motivation has similarities
to that of gradient boosted trees but focuses on the explainability of the
model that is missing for boosted trees. The proposed algorithm thus balances
accuracy and explainable machine learning by identifying a limited number of
subsets with simple structures. We applied the proposed algorithm to the
international stroke dataset to predict the probability of survival. Our
bottom-up sub-setting algorithm performed on an average 15% better than the
top-down decision tree learned on the entire dataset. The different decision
trees learned on the identified subsets use some of the previously unused
features by the whole dataset decision tree, and each subset represents a
distinct population of data.
Related papers
- OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable? [18.801143204410913]
We formalize the task of algorithm selection for OOD generalization and investigate whether it could be approached by learning.
We propose a solution, dubbed OOD-Chameleon that treats the task as a supervised classification over candidate algorithms.
We train the model to predict the relative performance of algorithms given a dataset's characteristics.
arXiv Detail & Related papers (2024-10-03T17:52:42Z) - Classification Tree-based Active Learning: A Wrapper Approach [4.706932040794696]
This paper proposes a wrapper active learning method for classification, organizing the sampling process into a tree structure.
A classification tree constructed on an initial set of labeled samples is considered to decompose the space into low-entropy regions.
This adaptation proves to be a significant enhancement over existing active learning methods.
arXiv Detail & Related papers (2024-04-15T17:27:00Z) - Topological Quality of Subsets via Persistence Matching Diagrams [0.196629787330046]
We measure the quality of a subset concerning the dataset it represents using topological data analysis techniques.
In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.
arXiv Detail & Related papers (2023-06-04T17:08:41Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - Discrete Tree Flows via Tree-Structured Permutations [5.929956715430168]
discrete flow-based models cannot be straightforwardly optimized with conventional deep learning methods because gradients of discrete functions are undefined or zero.
Our approach seeks to reduce computational burden and remove the need for pseudo-gradients by developing a discrete flow based on decision trees.
arXiv Detail & Related papers (2022-07-04T23:11:04Z) - Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data.
We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Towards Optimally Efficient Tree Search with Deep Learning [76.64632985696237]
This paper investigates the classical integer least-squares problem which estimates signals integer from linear models.
The problem is NP-hard and often arises in diverse applications such as signal processing, bioinformatics, communications and machine learning.
We propose a general hyper-accelerated tree search (HATS) algorithm by employing a deep neural network to estimate the optimal estimation for the underlying simplified memory-bounded A* algorithm.
arXiv Detail & Related papers (2021-01-07T08:00:02Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - MurTree: Optimal Classification Trees via Dynamic Programming and Search [61.817059565926336]
We present a novel algorithm for learning optimal classification trees based on dynamic programming and search.
Our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances.
arXiv Detail & Related papers (2020-07-24T17:06:55Z) - Data Structures & Algorithms for Exact Inference in Hierarchical
Clustering [41.24805506595378]
We present novel dynamic-programming algorithms for emphexact inference in hierarchical clustering based on a novel trellis data structure.
Our algorithms scale in time and space proportional to the powerset of $N$ elements which is super-exponentially more efficient than explicitly considering each of the (2N-3)!! possible hierarchies.
arXiv Detail & Related papers (2020-02-26T17:43:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.