A Scalable MIP-based Method for Learning Optimal Multivariate Decision
Trees
- URL: http://arxiv.org/abs/2011.03375v1
- Date: Fri, 6 Nov 2020 14:17:41 GMT
- Title: A Scalable MIP-based Method for Learning Optimal Multivariate Decision
Trees
- Authors: Haoran Zhu, Pavankumar Murali, Dzung T. Phan, Lam M. Nguyen, Jayant R.
Kalagnanam
- Abstract summary: We propose a novel MIP formulation, based on a 1-norm support vector machine model, to train a multivariate ODT for classification problems.
We provide cutting plane techniques that tighten the linear relaxation of the MIP formulation, in order to improve run times to reach optimality.
We demonstrate that our formulation outperforms its counterparts in the literature by an average of about 10% in terms of mean out-of-sample testing accuracy.
- Score: 17.152864798265455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent publications report advances in training optimal decision
trees (ODT) using mixed-integer programs (MIP), due to algorithmic advances in
integer programming and a growing interest in addressing the inherent
suboptimality of heuristic approaches such as CART. In this paper, we propose a
novel MIP formulation, based on a 1-norm support vector machine model, to train
a multivariate ODT for classification problems. We provide cutting plane
techniques that tighten the linear relaxation of the MIP formulation, in order
to improve run times to reach optimality. Using 36 data-sets from the
University of California Irvine Machine Learning Repository, we demonstrate
that our formulation outperforms its counterparts in the literature by an
average of about 10% in terms of mean out-of-sample testing accuracy across the
data-sets. We provide a scalable framework to train multivariate ODT on large
data-sets by introducing a novel linear programming (LP) based data selection
method to choose a subset of the data for training. Our method is able to
routinely handle large data-sets with more than 7,000 sample points and
outperform heuristics methods and other MIP based techniques. We present
results on data-sets containing up to 245,000 samples. Existing MIP-based
methods do not scale well on training data-sets beyond 5,500 samples.
Related papers
- PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection [28.442470930703337]
PRISM is a training-free approach for efficient multimodal data selection.
It uses Pearson correlation analysis to quantify the intrinsic visual encoding properties of MLLMs.
It reduces the overall time required for visual instruction tuning and data selection to just 30% of conventional methods.
arXiv Detail & Related papers (2025-02-17T18:43:41Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.
We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.
We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Multi-Agent Collaborative Data Selection for Efficient LLM Pretraining [40.21546440726592]
We propose a novel multi-agent collaborative data selection mechanism for large language models (LLMs) pretraining.
In this framework, each data selection method serves as an independent agent, and an agent console is designed to dynamically integrate the information from all agents.
arXiv Detail & Related papers (2024-10-10T16:45:28Z) - Align Your Steps: Optimizing Sampling Schedules in Diffusion Models [63.927438959502226]
Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond.
A crucial drawback of DMs is their slow sampling speed, relying on many sequential function evaluations through large neural networks.
We propose a general and principled approach to optimizing the sampling schedules of DMs for high-quality outputs.
arXiv Detail & Related papers (2024-04-22T18:18:41Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - Deep Learning with Multiple Data Set: A Weighted Goal Programming
Approach [2.7393821783237184]
Large-scale data analysis is growing at an exponential rate as data proliferates in our societies.
Deep Learning models require plenty of resources, and distributed training is needed.
This paper presents a Multicriteria approach for distributed learning.
arXiv Detail & Related papers (2021-11-27T07:10:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.