RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression
- URL: http://arxiv.org/abs/2511.19493v1
- Date: Sun, 23 Nov 2025 12:00:33 GMT
- Title: RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression
- Authors: Chris Kuchar,
- Abstract summary: RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python.<n>RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)<n> Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler's Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)--all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity--combining upper-triangle storage with block-sparse thresholding--achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler's original methodology.
Related papers
- GPU-Accelerated Algorithms for Graph Vector Search: Taxonomy, Empirical Study, and Research Directions [54.570944939061555]
We present a comprehensive study of GPU-accelerated graph-based vector search algorithms.<n>We establish a detailed taxonomy of GPU optimization strategies and clarify the mapping between algorithmic tasks and hardware execution units.<n>Our findings offer clear guidelines for designing scalable and robust GPU-powered approximate nearest neighbor search systems.
arXiv Detail & Related papers (2026-02-10T16:18:04Z) - Karhunen-Loève Expansion-Based Residual Anomaly Map for Resource-Efficient Glioma MRI Segmentation [0.0]
Deep learning is currently the state-of-the-art for brain tumor segmentation.<n>State-of-the-art models use thousands of training cases and vast computational power, where performance drops sharply when either is limited.<n>This study implements a feature extraction step on downsampled, z-score normalized MRI volumes.<n>It achieves post-processed Dice scores of 0.929 (WT), 0.856 (TC), and 0.821 (ET), with HD95 distances of 2.93, 6.78, and 10.35 voxels.
arXiv Detail & Related papers (2026-01-16T23:48:40Z) - GPU-Accelerated ANNS: Quantized for Speed, Built for Change [1.8419317899207142]
Current Approximate nearest neighbor search (ANNS) systems face three key limitations.<n>Current systems lack efficient quantization techniques that reduce data movement without introducing costly random memory accesses.<n>We present Jasper, a GPU-accelerated ANNS system with both high query throughput and upability.
arXiv Detail & Related papers (2026-01-11T19:51:54Z) - FuseSampleAgg: Fused Neighbor Sampling and Aggregation for Mini-batch GNNs [51.56484100374058]
FuseSampleAgg fuses neighbor and sampling mean aggregation into a single pass for GraphSAGE.<n>Operator is deterministic, integrates with standard PyTorchs, and ships with scripts that reproduce all tables and figures from CSV logs.
arXiv Detail & Related papers (2025-11-17T17:57:18Z) - Efficient Low Rank Attention for Long-Context Inference in Large Language Models [41.24530756499533]
Low Rank Query and Key attention (LRQK) is a framework that decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage.<n>By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU- CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs.
arXiv Detail & Related papers (2025-10-25T11:43:27Z) - xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads [2.2991119948183525]
estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing.<n>We propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements.<n>The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem's benefits.
arXiv Detail & Related papers (2025-10-23T23:16:27Z) - 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [52.079202872069835]
Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs) have grown rapidly in size.<n>We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z) - HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading [79.38548165722229]
HEADINFER offloads the KV cache to CPU RAM while avoiding the need to fully store the KV cache for any transformer layer on the GPU.<n>We demonstrate HEADINFER maintains computational efficiency while significantly reducing memory footprint.
arXiv Detail & Related papers (2025-02-18T06:26:05Z) - KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation [7.204881999658682]
Key-Value cache is used to store intermediate activations for large language models.<n>The memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory.<n>Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution.<n>We introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations.<n> KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T04:03:14Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs [4.536118764799076]
Fine-tuning pre-trained large language models with limited hardware presents challenges due to GPU memory constraints.
We introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods.
We show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%.
arXiv Detail & Related papers (2024-04-16T22:11:35Z) - HEAT: A Highly Efficient and Affordable Training System for
Collaborative Filtering Based Recommendation on CPUs [11.007606356081435]
Collaborative filtering (CF) has been proven to be one of the most effective techniques for recommendation.
There is no work that optimized SimpleX on multi-core CPUs, leading to limited performance.
We propose an efficient CF training system (called HEAT) that fully enables the multi-level caching and multi-threading capabilities of modern CPUs.
arXiv Detail & Related papers (2023-04-14T18:07:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.