Related papers: Scaling Up Diffusion and Flow-based XGBoost Models

Scaling Up Diffusion and Flow-based XGBoost Models

URL: http://arxiv.org/abs/2408.16046v1
Date: Wed, 28 Aug 2024 18:00:00 GMT
Title: Scaling Up Diffusion and Flow-based XGBoost Models
Authors: Jesse C. Cresswell, Taewoo Kim,
Abstract summary: We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models. With better implementation it can be scaled to datasets 370x larger than previously used. We present results on large-scale scientific datasets as part of the Fast Calorimeter Simulation Challenge.
Score: 5.944645679491607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. Our efficient implementation also unlocks scaling models to much larger sizes which we show directly leads to improved performance on benchmark tasks. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge. Code is available at https://github.com/layer6ai-labs/calo-forest.

Related papers

Exploring Training and Inference Scaling Laws in Generative Retrieval [50.82554729023865]
Generative retrieval reformulates retrieval as an autoregressive generation task, where large language models generate target documents directly from a query.<n>We systematically investigate training and inference scaling laws in generative retrieval, exploring how model size, training data scale, and inference-time compute jointly influence performance.
arXiv Detail & Related papers (2025-03-24T17:59:03Z)
SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Current state-of-the-art methods focus on training innovative architectural designs on confined datasets. We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z)
Exploiting Local Features and Range Images for Small Data Real-Time Point Cloud Semantic Segmentation [4.02235104503587]
In this paper, we harness the information from the three-dimensional representation to proficiently capture local features. A GPU-based KDTree allows for rapid building, querying, and enhancing projection with straightforward operations. We show that a reduced version of our model not only demonstrates strong competitiveness against full-scale state-of-the-art models but also operates in real-time.
arXiv Detail & Related papers (2024-10-14T13:49:05Z)
Generative Expansion of Small Datasets: An Expansive Graph Approach [13.053285552524052]
We introduce an Expansive Synthesis model generating large-scale, information-rich datasets from minimal samples. An autoencoder with self-attention layers and optimal transport refines distributional consistency. Results show comparable performance, demonstrating the model's potential to augment training data effectively.
arXiv Detail & Related papers (2024-06-25T02:59:02Z)
Generative Active Learning for Long-tailed Instance Segmentation [55.66158205855948]
We propose BSGAL, a new algorithm that estimates the contribution of generated data based on cache gradient. Experiments show that BSGAL outperforms the baseline approach and effectually improves the performance of long-tailed segmentation.
arXiv Detail & Related papers (2024-06-04T15:57:43Z)
Pushing the Limits of Pre-training for Time Series Forecasting in the CloudOps Domain [54.67888148566323]
We introduce three large-scale time series forecasting datasets from the cloud operations domain. We show it is a strong zero-shot baseline and benefits from further scaling, both in model and dataset size. Accompanying these datasets and results is a suite of comprehensive benchmark results comparing classical and deep learning baselines to our pre-trained method.
arXiv Detail & Related papers (2023-10-08T08:09:51Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
X-TIME: An in-memory engine for accelerating machine learning on tabular data with CAMs [19.086291506702413]
Modern tree-based Machine Learning models shine in extracting relevant information from structured data. In this work, we focus on an overall analog-digital architecture implementing a novel increased precision analog CAM. Results evaluated in a single chip at 16nm technology show 119x lower latency at 9740x higher throughput compared with a state-of-the-art GPU.
arXiv Detail & Related papers (2023-04-03T18:20:31Z)
A Framework for Large Scale Synthetic Graph Dataset Generation [2.248608623448951]
This work proposes a scalable synthetic graph generation tool to scale the datasets to production-size graphs. The tool learns a series of parametric models from proprietary datasets that can be released to researchers. We demonstrate the generalizability of the framework across a series of datasets.
arXiv Detail & Related papers (2022-10-04T22:41:33Z)
Condensing Graphs via One-Step Gradient Matching [50.07587238142548]
We propose a one-step gradient matching scheme, which performs gradient matching for only one single step without training the network weights. Our theoretical analysis shows this strategy can generate synthetic graphs that lead to lower classification loss on real graphs. In particular, we are able to reduce the dataset size by 90% while approximating up to 98% of the original performance.
arXiv Detail & Related papers (2022-06-15T18:20:01Z)
A Simple and Fast Baseline for Tuning Large XGBoost Models [8.203493207581937]
We show that uniform subsampling makes for a simple yet fast baseline to speed up the tuning of large XGBoost models. We demonstrate the effectiveness of this baseline on large-scale datasets ranging from $15-70mathrmGB$ in size.
arXiv Detail & Related papers (2021-11-12T20:17:50Z)
Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training. We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark. In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
Deep Structure Learning using Feature Extraction in Trained Projection Space [0.0]
We introduce a network architecture using a self-adjusting and data dependent version of the Radon-transform (linear data projection), also known as x-ray projection, to enable feature extraction via convolutions in lower-dimensional space. The resulting framework, named PiNet, can be trained end-to-end and shows promising performance on volumetric segmentation tasks.
arXiv Detail & Related papers (2020-09-01T12:16:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.