Wide Boosting
- URL: http://arxiv.org/abs/2007.09855v4
- Date: Sun, 6 Nov 2022 03:15:10 GMT
- Title: Wide Boosting
- Authors: Michael T. Horrell
- Abstract summary: This paper presents a simple adjustment to Gradient Boosting motivated in part by artificial neural networks.
We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient Boosting (GB) is a popular methodology used to solve prediction
problems by minimizing a differentiable loss function, $L$. GB performs very
well on tabular machine learning (ML) problems; however, as a pure ML solver it
lacks the ability to fit models with probabilistic but correlated
multi-dimensional outputs, for example, multiple correlated Bernoulli outputs.
GB also does not form intermediate abstract data embeddings, one property of
Deep Learning that gives greater flexibility and performance on other types of
problems. This paper presents a simple adjustment to GB motivated in part by
artificial neural networks. Specifically, our adjustment inserts a matrix
multiplication between the output of a GB model and the loss, $L$. This allows
the output of a GB model to have increased dimension prior to being fed into
the loss and is thus ``wider'' than standard GB implementations. We call our
method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional
output tasks and that the embeddings generated by WB contain are more useful in
downstream prediction tasks than GB output predictions alone.
Related papers
- BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation [54.28841287750586]
Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.
Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning.
This paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss.
arXiv Detail & Related papers (2024-02-18T12:44:15Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Zero-Space Cost Fault Tolerance for Transformer-based Language Models on
ReRAM [27.354689865791638]
Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs)
Hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference.
We propose a fault protection mechanism that incurs zero space cost.
arXiv Detail & Related papers (2024-01-22T02:50:38Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Actually Sparse Variational Gaussian Processes [20.71289963037696]
We propose a new class of inter-domain variational GP constructed by projecting a GP onto a set of compactly supported B-spline basis functions.
This allows us to very efficiently model fast-varying spatial phenomena with tens of thousands of inducing variables.
arXiv Detail & Related papers (2023-04-11T09:38:58Z) - Outlier Suppression: Pushing the Limit of Low-bit Transformer Language
Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance.
We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping.
This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z) - Automatic Mixed-Precision Quantization Search of BERT [62.65905462141319]
Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks.
These models usually contain millions of parameters, which prevents them from practical deployment on resource-constrained devices.
We propose an automatic mixed-precision quantization framework designed for BERT that can simultaneously conduct quantization and pruning in a subgroup-wise level.
arXiv Detail & Related papers (2021-12-30T06:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.