Binary Split Categorical feature with Mean Absolute Error Criteria in CART
- URL: http://arxiv.org/abs/2511.08470v1
- Date: Wed, 12 Nov 2025 02:00:25 GMT
- Title: Binary Split Categorical feature with Mean Absolute Error Criteria in CART
- Authors: Peng Yu, Yike Chen, Chao Xu, Albert Bifet, Jesse Read,
- Abstract summary: Using the Mean Absolute Error criterion for categorical features has traditionally relied on various numerical encoding methods.<n>We present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion.
- Score: 18.476195198589462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.
Related papers
- Principled Algorithms for Optimizing Generalized Metrics in Binary Classification [53.604375124674796]
We introduce principled algorithms for optimizing generalized metrics, supported by $H$-consistency and finite-sample generalization bounds.<n>Our approach reformulates metric optimization as a generalized cost-sensitive learning problem.<n>We develop new algorithms, METRO, with strong theoretical performance guarantees.
arXiv Detail & Related papers (2025-12-29T01:33:42Z) - Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders [77.84801537608651]
Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance.
We propose a sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity.
arXiv Detail & Related papers (2024-05-06T17:14:34Z) - Synergistic eigenanalysis of covariance and Hessian matrices for enhanced binary classification [72.77513633290056]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our method captures intricate patterns and relationships, enhancing classification performance.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - Enumerating the k-fold configurations in multi-class classification
problems [0.0]
The crisis faced by artificial intelligence partly results from the irreproducibility of reported k-fold cross-validation-based performance scores.
Recently, we introduced numerical techniques to test the consistency of claimed performance scores and experimental setups.
In a crucial use case, the method relies on the enumeration of all k-fold configurations, for which we proposed an algorithm in the binary classification case.
arXiv Detail & Related papers (2024-01-24T22:40:00Z) - Hierarchical confusion matrix for classification performance evaluation [0.0]
We develop the concept of a hierarchical confusion matrix and prove its applicability to all types of hierarchical classification problems.
We use measures based on the novel confusion matrix to evaluate models within a benchmark for three real world hierarchical classification applications.
The results outline the reasonability of this approach and its usefulness to evaluate hierarchical classification problems.
arXiv Detail & Related papers (2023-06-15T19:31:59Z) - Polar Encoding: A Simple Baseline Approach for Classification with Missing Values [1.7205106391379026]
polar encoding is a representation of $[0,1]$-valued attributes with missing values.
It does not require imputation, ensures that missing values are equidistant from non-missing values, and lets decision tree algorithms choose how to split missing values.
We show that, in terms of the resulting classification performance, polar encoding performs better than the state-of-the-art strategies "multiple imputation by chained equations" and "multiple imputation with denoising autoencoders"
arXiv Detail & Related papers (2022-10-04T20:56:24Z) - Unbiased Subdata Selection for Fair Classification: A Unified Framework
and Scalable Algorithms [0.8376091455761261]
We show that many classification models within this framework can be recast as mixed-integer convex programs.
We then show that in the proposed problem, when the classification outcomes, "unsolvable subdata selection," is strongly-solvable.
This motivates us to develop an iterative refining strategy (IRS) to solve the classification instances.
arXiv Detail & Related papers (2020-12-22T21:09:38Z) - Classification with Rejection Based on Cost-sensitive Classification [83.50402803131412]
We propose a novel method of classification with rejection by ensemble of learning.
Experimental results demonstrate the usefulness of our proposed approach in clean, noisy, and positive-unlabeled classification.
arXiv Detail & Related papers (2020-10-22T14:05:05Z) - Coherent Hierarchical Multi-Label Classification Networks [56.41950277906307]
C-HMCNN(h) is a novel approach for HMC problems, which exploits hierarchy information in order to produce predictions coherent with the constraint and improve performance.
We conduct an extensive experimental analysis showing the superior performance of C-HMCNN(h) when compared to state-of-the-art models.
arXiv Detail & Related papers (2020-10-20T09:37:02Z) - Self-Weighted Robust LDA for Multiclass Classification with Edge Classes [111.5515086563592]
A novel self-weighted robust LDA with l21-norm based between-class distance criterion, called SWRLDA, is proposed for multi-class classification.
The proposed SWRLDA is easy to implement, and converges fast in practice.
arXiv Detail & Related papers (2020-09-24T12:32:55Z) - High-Dimensional Quadratic Discriminant Analysis under Spiked Covariance
Model [101.74172837046382]
We propose a novel quadratic classification technique, the parameters of which are chosen such that the fisher-discriminant ratio is maximized.
Numerical simulations show that the proposed classifier not only outperforms the classical R-QDA for both synthetic and real data but also requires lower computational complexity.
arXiv Detail & Related papers (2020-06-25T12:00:26Z) - Clustering and Classification with Non-Existence Attributes: A Sentenced
Discrepancy Measure Based Technique [0.0]
Clustering approaches cannot be applied directly to such data unless pre-processing by techniques like imputation or marginalization.
We have overcome this drawback by utilizing a Sentenced Discrepancy Measure which we refer to as the Attribute Weighted Penalty based Discrepancy (AWPD)
This technique is designed to trace invaluable data to: directly apply our method on the datasets which have Non-Existence attributes and establish a method for detecting unstructured Non-Existence attributes with the best accuracy rate and minimum cost.
arXiv Detail & Related papers (2020-02-24T17:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.