Related papers: Controlling the False Split Rate in Tree-Based Aggregation

Controlling the False Split Rate in Tree-Based Aggregation

URL: http://arxiv.org/abs/2108.05350v1
Date: Wed, 11 Aug 2021 17:59:22 GMT
Title: Controlling the False Split Rate in Tree-Based Aggregation
Authors: Simeng Shao, Jacob Bien, Adel Javanmard
Abstract summary: We propose a hypothesis testing algorithm for tree-based aggregation. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients.
Score: 11.226095593522691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In many domains, data measurements can naturally be associated with the leaves of a tree, expressing the relationships among these measurements. For example, companies belong to industries, which in turn belong to ever coarser divisions such as sectors; microbes are commonly arranged in a taxonomic hierarchy from species to kingdoms; street blocks belong to neighborhoods, which in turn belong to larger-scale regions. The problem of tree-based aggregation that we consider in this paper asks which of these tree-defined subgroups of leaves should really be treated as a single entity and which of these entities should be distinguished from each other. We introduce the "false split rate", an error measure that describes the degree to which subgroups have been split when they should not have been. We then propose a multiple hypothesis testing algorithm for tree-based aggregation, which we prove controls this error measure. We focus on two main examples of tree-based aggregation, one which involves aggregating means and the other which involves aggregating regression coefficients. We apply this methodology to aggregate stocks based on their volatility and to aggregate neighborhoods of New York City based on taxi fares.

Related papers

LILI clustering algorithm: Limit Inferior Leaf Interval Integrated into Causal Forest for Causal Interference [1.4875602190483512]
Causal forest methods are powerful tools in causal inference.<n>We propose a novel approach that establishes connections between causal trees through the Limit Inferior Leaf Interval (LILI) clustering algorithm.
arXiv Detail & Related papers (2025-07-04T03:04:00Z)
Identifying General Mechanism Shifts in Linear Causal Representations [58.6238439611389]
We consider the linear causal representation learning setting where we observe a linear mixing of $d$ unknown latent factors. Recent work has shown that it is possible to recover the latent factors as well as the underlying structural causal model over them. We provide a surprising identifiability result that it is indeed possible, under some very mild standard assumptions, to identify the set of shifted nodes.
arXiv Detail & Related papers (2024-10-31T15:56:50Z)
When does Subagging Work? [0.0]
We study the effectiveness of subagging, or subsample aggregating, on regression trees. We formalize that (i) the bias depends on the diameter of cells, hence trees with few splits tend to be biased. We compare the performance of subagging to that of trees across different numbers of splits.
arXiv Detail & Related papers (2024-04-02T10:44:55Z)
Why do Random Forests Work? Understanding Tree Ensembles as Self-Regularizing Adaptive Smoothers [68.76846801719095]
We argue that the current high-level dichotomy into bias- and variance-reduction prevalent in statistics is insufficient to understand tree ensembles. We show that forests can improve upon trees by three distinct mechanisms that are usually implicitly entangled.
arXiv Detail & Related papers (2024-02-02T15:36:43Z)
Effective and Efficient Federated Tree Learning on Hybrid Data [80.31870543351918]
We propose HybridTree, a novel federated learning approach that enables federated tree learning on hybrid data. We observe the existence of consistent split rules in trees and show that the knowledge of parties can be incorporated into the lower layers of a tree. Our experiments demonstrate that HybridTree can achieve comparable accuracy to the centralized setting with low computational and communication overhead.
arXiv Detail & Related papers (2023-10-18T10:28:29Z)
Distribution and volume based scoring for Isolation Forests [0.0]
We make two contributions to the Isolation Forest method for anomaly and outlier detection. The first is an information-theoretically motivated generalisation of the score function that is used to aggregate the scores across random tree estimators. The second is an alternative scoring function at the level of the individual tree estimator, in which we replace the depth-based scoring of the Isolation Forest with one based on hyper-volumes associated to an isolation tree's leaf nodes.
arXiv Detail & Related papers (2023-09-20T16:27:10Z)
Hierarchical clustering with dot products recovers hidden tree structure [53.68551192799585]
In this paper we offer a new perspective on the well established agglomerative clustering algorithm, focusing on recovery of hierarchical structure. We recommend a simple variant of the standard algorithm, in which clusters are merged by maximum average dot product and not, for example, by minimum distance or within-cluster variance. We demonstrate that the tree output by this algorithm provides a bona fide estimate of generative hierarchical structure in data, under a generic probabilistic graphical model.
arXiv Detail & Related papers (2023-05-24T11:05:12Z)
HiPerformer: Hierarchically Permutation-Equivariant Transformer for Time Series Forecasting [56.95572957863576]
We propose a hierarchically permutation-equivariant model that considers both the relationship among components in the same group and the relationship among groups. The experiments conducted on real-world data demonstrate that the proposed method outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2023-05-14T05:11:52Z)
Is it easier to count communities than find them? [57.07487980088801]
We consider certain hypothesis testing problems between models with different community structures.<n>We show (in the low-degree framework) that testing between two options is as hard as finding the communities.
arXiv Detail & Related papers (2022-12-21T09:35:19Z)
Factor-augmented tree ensembles [0.0]
This manuscript proposes to extend the information set of time-series regression trees with latent stationary factors extracted via state-space methods. It allows to handle predictors that exhibit measurement error, non-stationary trends, seasonality and/or irregularities such as missing observations. Empirically, ensembles of these factor-augmented trees provide a reliable approach for macro-finance problems.
arXiv Detail & Related papers (2021-11-27T22:44:54Z)
Exemplars can Reciprocate Principal Components [0.0]
Category Trees is a clustering method that creates tree structures that branch on category type and not feature. The theory is demonstrated using the Portugal Forest Fires dataset as a case study.
arXiv Detail & Related papers (2021-03-22T12:46:29Z)
Trees-Based Models for Correlated Data [8.629912408966147]
We show the problems that arise when implementing standard trees-based regression models, which ignore the correlation structure. Our new approach explicitly takes the correlation structure into account in the splitting criterion. The superiority of our new approach over trees-based models that do not account for the correlation is supported by simulation experiments and real data analyses.
arXiv Detail & Related papers (2021-02-16T12:30:48Z)
Pairwise Supervision Can Provably Elicit a Decision Boundary [84.58020117487898]
Similarity learning is a problem to elicit useful representations by predicting the relationship between a pair of patterns. We show that similarity learning is capable of solving binary classification by directly eliciting a decision boundary.
arXiv Detail & Related papers (2020-06-11T05:35:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.