Related papers: Optimizing High-Dimensional Oblique Splits

Optimizing High-Dimensional Oblique Splits

URL: http://arxiv.org/abs/2503.14381v1
Date: Tue, 18 Mar 2025 16:14:38 GMT
Title: Optimizing High-Dimensional Oblique Splits
Authors: Chien-Ming Chi,
Abstract summary: This paper explores optimizing high-dimensional $s$-sparse oblique splits from $(vecw, vecwtopboldsymbolX_i) : iin 1,dots, n, vecw in mathbbRp, | vecw |_0 leq s $ for growing oblique trees.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Orthogonal-split trees perform well, but evidence suggests oblique splits can enhance their performance. This paper explores optimizing high-dimensional $s$-sparse oblique splits from $\{(\vec{w}, \vec{w}^{\top}\boldsymbol{X}_{i}) : i\in \{1,\dots, n\}, \vec{w} \in \mathbb{R}^p, \| \vec{w} \|_{2} = 1, \| \vec{w} \|_{0} \leq s \}$ for growing oblique trees, where $ s $ is a user-defined sparsity parameter. We establish a connection between SID convergence and $s_0$-sparse oblique splits with $s_0\ge 1$, showing that the SID function class expands as $s_0$ increases, enabling the capture of more complex data-generating functions such as the $s_0$-dimensional XOR function. Thus, $s_0$ represents the unknown potential complexity of the underlying data-generating function. Learning these complex functions requires an $s$-sparse oblique tree with $s \geq s_0$ and greater computational resources. This highlights a trade-off between statistical accuracy, governed by the SID function class size depending on $s_0$, and computational cost. In contrast, previous studies have explored the problem of SID convergence using orthogonal splits with $ s_0 = s = 1 $, where runtime was less critical. Additionally, we introduce a practical framework for oblique trees that integrates optimized oblique splits alongside orthogonal splits into random forests. The proposed approach is assessed through simulations and real-data experiments, comparing its performance against various oblique tree models.

Related papers

Fast unsupervised ground metric learning with tree-Wasserstein distance [14.235762519615175]
unsupervised ground metric learning approaches have been introduced.<n>One promising option employs Wasserstein singular vectors (WSVs), which emerge when computing optimal transport distances between features and samples simultaneously.<n>We propose to augment the WSV method by embedding samples and features on trees, on which we compute the tree-Wasserstein distance (TWD)
arXiv Detail & Related papers (2024-11-11T23:21:01Z)
Memory-Constrained Algorithms for Convex Optimization via Recursive Cutting-Planes [23.94542304111204]
First class of algorithms that provides a positive trade-off between gradient descent and cutting-plane methods in any regime with $epsilonleq 1/sqrt d$. In the regime $epsilon leq d-Omega(d)$, our algorithm with $p=d$ achieves the information-theoretic optimal memory usage and improves the oracle-complexity of gradient descent.
arXiv Detail & Related papers (2023-06-16T17:00:51Z)
Simplifying and Understanding State Space Models with Diagonal Linear RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs. We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs. We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z)
Generalization Properties of Decision Trees on Real-valued and Categorical Features [2.370481325034443]
We revisit binary decision trees from the perspective of partitions of the data. We consider three types of features: real-valued, categorical ordinal and categorical nominal, with different split rules for each.
arXiv Detail & Related papers (2022-10-18T21:50:24Z)
Algorithms for Discrepancy, Matchings, and Approximations: Fast, Simple, and Practical [1.2183405753834562]
Given a finite set system $(X,mathcal S)$, the emphdiscrepancy of a two-coloring $chi:Xto-1,1$ is defined as $max_S in mathcal S|chi(S)|$. We propose a randomized algorithm which, for any $d>0$ and $(X,mathcal S)$ with dual shatter function $pi*(k)=O(kd)$, returns a coloring with expected discrepancy
arXiv Detail & Related papers (2022-09-02T15:59:09Z)
Optimal Gradient Sliding and its Application to Distributed Optimization Under Similarity [121.83085611327654]
We structured convex optimization problems with additive objective $r:=p + q$, where $r$ is $mu$-strong convex similarity. We proposed a method to solve problems master to agents' communication and local calls. The proposed method is much sharper than the $mathcalO(sqrtL_q/mu)$ method.
arXiv Detail & Related papers (2022-05-30T14:28:02Z)
Empirical complexity of comparator-based nearest neighbor descent [0.0]
A Java parallel streams implementation of the $K$-nearest neighbor descent algorithm is presented. Experiments with the Kullback-Leibler divergence Comparator support the prediction that the number of rounds of $K$-nearest neighbor updates need not exceed twice the diameter.
arXiv Detail & Related papers (2022-01-30T21:37:53Z)
Fast Graph Sampling for Short Video Summarization using Gershgorin Disc Alignment [52.577757919003844]
We study the problem of efficiently summarizing a short video into several paragraphs, leveraging recent progress in fast graph sampling. Experimental results show that our algorithm achieves comparable video summarization as state-of-the-art methods, at a substantially reduced complexity.
arXiv Detail & Related papers (2021-10-21T18:43:00Z)
Universal guarantees for decision tree induction via a higher-order splitting criterion [16.832966312395126]
Our algorithm achieves provable guarantees for all target functions $f: -1,1n to -1,1$ with respect to the uniform distribution. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. Our algorithm satisfies the following guarantee: for all target functions $f : -1,1n to -1,1$, sizes $sin mathbbN$, and error parameters $epsilon$, it constructs a decision
arXiv Detail & Related papers (2020-10-16T21:20:45Z)
An Algorithm for Learning Smaller Representations of Models With Scarce Data [0.0]
We present a greedy algorithm for solving binary classification problems in situations where the dataset is too small or not fully representative. It relies on a trained model with loose accuracy constraints, an iterative hyperparameter pruning procedure, and a function used to generate new data.
arXiv Detail & Related papers (2020-10-15T19:17:51Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
A Randomized Algorithm to Reduce the Support of Discrete Measures [79.55586575988292]
Given a discrete probability measure supported on $N$ atoms and a set of $n$ real-valued functions, there exists a probability measure that is supported on a subset of $n+1$ of the original $N$ atoms. We give a simple geometric characterization of barycenters via negative cones and derive a randomized algorithm that computes this new measure by "greedy geometric sampling" We then study its properties, and benchmark it on synthetic and real-world data to show that it can be very beneficial in the $Ngg n$ regime.
arXiv Detail & Related papers (2020-06-02T16:38:36Z)
On the Modularity of Hypernetworks [103.1147622394852]
We show that for a structured target function, the overall number of trainable parameters in a hypernetwork is smaller by orders of magnitude than the number of trainable parameters of a standard neural network and an embedding method.
arXiv Detail & Related papers (2020-02-23T22:51:52Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems. We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.