Controlling the False Split Rate in Tree-Based Aggregation
- URL: http://arxiv.org/abs/2108.05350v1
- Date: Wed, 11 Aug 2021 17:59:22 GMT
- Title: Controlling the False Split Rate in Tree-Based Aggregation
- Authors: Simeng Shao, Jacob Bien, Adel Javanmard
- Abstract summary: We propose a hypothesis testing algorithm for tree-based aggregation.
We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients.
- Score: 11.226095593522691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many domains, data measurements can naturally be associated with the
leaves of a tree, expressing the relationships among these measurements. For
example, companies belong to industries, which in turn belong to ever coarser
divisions such as sectors; microbes are commonly arranged in a taxonomic
hierarchy from species to kingdoms; street blocks belong to neighborhoods,
which in turn belong to larger-scale regions. The problem of tree-based
aggregation that we consider in this paper asks which of these tree-defined
subgroups of leaves should really be treated as a single entity and which of
these entities should be distinguished from each other.
We introduce the "false split rate", an error measure that describes the
degree to which subgroups have been split when they should not have been. We
then propose a multiple hypothesis testing algorithm for tree-based
aggregation, which we prove controls this error measure. We focus on two main
examples of tree-based aggregation, one which involves aggregating means and
the other which involves aggregating regression coefficients. We apply this
methodology to aggregate stocks based on their volatility and to aggregate
neighborhoods of New York City based on taxi fares.
Related papers
- When does Subagging Work? [0.0]
We study the effectiveness of subagging, or subsample aggregating, on regression trees.
We formalize that (i) the bias depends on the diameter of cells, hence trees with few splits tend to be biased.
We compare the performance of subagging to that of trees across different numbers of splits.
arXiv Detail & Related papers (2024-04-02T10:44:55Z) - Why do Random Forests Work? Understanding Tree Ensembles as
Self-Regularizing Adaptive Smoothers [68.76846801719095]
We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles.
We show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled.
arXiv Detail & Related papers (2024-02-02T15:36:43Z) - Effective and Efficient Federated Tree Learning on Hybrid Data [80.31870543351918]
We propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data.
We observe the existence of consistent split rules in trees and show that the knowledge of parties can be incorporated into the lower layers of a tree.
Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead.
arXiv Detail & Related papers (2023-10-18T10:28:29Z) - Distribution and volume based scoring for Isolation Forests [0.0]
We make two contributions to the Isolation Forest method for anomaly and outlier detection.
The first is an information-theoretically motivated generalisation of the score function that is used to aggregate the scores across random tree estimators.
The second is an alternative scoring function at the level of the individual tree estimator, in which we replace the depth-based scoring of the Isolation Forest with one based on hyper-volumes associated to an isolation tree's leaf nodes.
arXiv Detail & Related papers (2023-09-20T16:27:10Z) - Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure.
We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance.
We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z) - HiPerformer: Hierarchically Permutation-Equivariant Transformer for Time
Series Forecasting [56.95572957863576]
We propose a hierarchically permutation-equivariant model that considers both the relationship among components in the same group and the relationship among groups.
The experiments conducted on real-world data demonstrate that the proposed method outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2023-05-14T05:11:52Z) - Factor-augmented tree ensembles [0.0]
This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods.
It allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations.
Empirically, ensembles of these factor-augmented trees provide a reliable approach for macro-finance problems.
arXiv Detail & Related papers (2021-11-27T22:44:54Z) - Exemplars can Reciprocate Principal Components [0.0]
Category Trees is a clustering method that creates tree structures that branch on category type and not feature.
The theory is demonstrated using the Portugal Forest Fires dataset as a case study.
arXiv Detail & Related papers (2021-03-22T12:46:29Z) - Trees-Based Models for Correlated Data [8.629912408966147]
We show the problems that arise when implementing standard trees-based regression models, which ignore the correlation structure.
Our new approach explicitly takes the correlation structure into account in the splitting criterion.
The superiority of our new approach over trees-based models that do not account for the correlation is supported by simulation experiments and real data analyses.
arXiv Detail & Related papers (2021-02-16T12:30:48Z) - Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance
Segmentation [75.93960390191262]
We exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes.
We propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution.
Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models.
arXiv Detail & Related papers (2020-08-13T03:52:37Z) - Pairwise Supervision Can Provably Elicit a Decision Boundary [84.58020117487898]
Similarity learning is a problem to elicit useful representations by predicting the relationship between a pair of patterns.
We show that similarity learning is capable of solving binary classification by directly eliciting a decision boundary.
arXiv Detail & Related papers (2020-06-11T05:35:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.