Modeling Text with Decision Forests using Categorical-Set Splits
- URL: http://arxiv.org/abs/2009.09991v3
- Date: Fri, 5 Feb 2021 11:44:02 GMT
- Title: Modeling Text with Decision Forests using Categorical-Set Splits
- Authors: Mathieu Guillame-Bert, Sebastian Bruch, Petr Mitrichev, Petr Mikheev,
Jan Pfeifer
- Abstract summary: In axis-aligned decision forests, the "decision" to route an input example is the result of the evaluation of a condition on a single dimension in the feature space.
We define a condition that is specific to categorical-set features and present an algorithm to learn it.
Our algorithm is efficient during training and the resulting conditions are fast to evaluate with our extension of the QuickScorer inference algorithm.
- Score: 2.434796198711328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decision forest algorithms typically model data by learning a binary tree
structure recursively where every node splits the feature space into two
sub-regions, sending examples into the left or right branch as a result. In
axis-aligned decision forests, the "decision" to route an input example is the
result of the evaluation of a condition on a single dimension in the feature
space. Such conditions are learned using efficient, often greedy algorithms
that optimize a local loss function. For example, a node's condition may be a
threshold function applied to a numerical feature, and its parameter may be
learned by sweeping over the set of values available at that node and choosing
a threshold that maximizes some measure of purity. Crucially, whether an
algorithm exists to learn and evaluate conditions for a feature type determines
whether a decision forest algorithm can model that feature type at all. For
example, decision forests today cannot consume textual features directly --
such features must be transformed to summary statistics instead. In this work,
we set out to bridge that gap. We define a condition that is specific to
categorical-set features -- defined as an unordered set of categorical
variables -- and present an algorithm to learn it, thereby equipping decision
forests with the ability to directly model text, albeit without preserving
sequential order. Our algorithm is efficient during training and the resulting
conditions are fast to evaluate with our extension of the QuickScorer inference
algorithm. Experiments on benchmark text classification datasets demonstrate
the utility and effectiveness of our proposal.
Related papers
- Utilising Explainable Techniques for Quality Prediction in a Complex Textiles Manufacturing Use Case [0.0]
This paper develops an approach to classify instances of product failure in a complex textiles manufacturing dataset using explainable techniques.
In investigating the trade-off between accuracy and explainability, three different tree-based classification algorithms were evaluated.
arXiv Detail & Related papers (2024-07-26T06:50:17Z) - Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a new framework based on large language models (LLMs) and decision Tree reasoning (OCTree)
Our key idea is to leverage LLMs' reasoning capabilities to find good feature generation rules without manually specifying the search space.
Our empirical results demonstrate that this simple framework consistently enhances the performance of various prediction models.
arXiv Detail & Related papers (2024-06-12T08:31:34Z) - Simplification of Forest Classifiers and Regressors [1.8275108630751844]
We study the problem of sharing as many branching conditions of a given forest classifier or regressor as possible.
We propose an algorithm for the original problem using an algorithm solving this problem efficiently.
The effectiveness of our method is demonstrated through comprehensive experiments using 21 datasets.
arXiv Detail & Related papers (2022-12-14T08:49:02Z) - Efficient Non-Parametric Optimizer Search for Diverse Tasks [93.64739408827604]
We present the first efficient scalable and general framework that can directly search on the tasks of interest.
Inspired by the innate tree structure of the underlying math expressions, we re-arrange the spaces into a super-tree.
We adopt an adaptation of the Monte Carlo method to tree search, equipped with rejection sampling and equivalent- form detection.
arXiv Detail & Related papers (2022-09-27T17:51:31Z) - Explaining random forest prediction through diverse rulesets [0.0]
Local Tree eXtractor (LTreeX) is able to explain the forest prediction for a given test instance with a few diverse rules.
We show that our proposed approach substantially outperforms other explainable methods in terms of predictive performance.
arXiv Detail & Related papers (2022-03-29T12:54:57Z) - Nonparametric Feature Selection by Random Forests and Deep Neural
Networks [4.232614032390374]
We propose a nonparametric feature selection algorithm that incorporates random forests and deep neural networks.
Although the algorithm is proposed using standard random forests, it can be widely adapted to other machine learning algorithms.
arXiv Detail & Related papers (2022-01-18T08:45:33Z) - Making CNNs Interpretable by Building Dynamic Sequential Decision
Forests with Top-down Hierarchy Learning [62.82046926149371]
We propose a generic model transfer scheme to make Convlutional Neural Networks (CNNs) interpretable.
We achieve this by building a differentiable decision forest on top of CNNs.
We name the transferred model deep Dynamic Sequential Decision Forest (dDSDF)
arXiv Detail & Related papers (2021-06-05T07:41:18Z) - E2E-FS: An End-to-End Feature Selection Method for Neural Networks [0.3222802562733786]
We present a novel selection algorithm, called EndtoEnd Feature Selection (E2FS)
Our algorithm, similar to the lasso approach, is solved with gradient descent techniques.
Although hard restrictions, experimental results show that this algorithm can be used with any learning model.
arXiv Detail & Related papers (2020-12-14T16:19:25Z) - Stochastic Optimization Forests [60.523606291705214]
We show how to train forest decision policies by growing trees that choose splits to directly optimize the downstream decision quality, rather than splitting to improve prediction accuracy as in the standard random forest algorithm.
We show that our approximate splitting criteria can reduce running time hundredfold, while achieving performance close to forest algorithms that exactly re-optimize for every candidate split.
arXiv Detail & Related papers (2020-08-17T16:56:06Z) - MurTree: Optimal Classification Trees via Dynamic Programming and Search [61.817059565926336]
We present a novel algorithm for learning optimal classification trees based on dynamic programming and search.
Our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances.
arXiv Detail & Related papers (2020-07-24T17:06:55Z) - Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback.
We devise an algorithm with a minimal cluster recovery error rate.
For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.