From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning
- URL: http://arxiv.org/abs/2512.19743v1
- Date: Wed, 17 Dec 2025 23:18:51 GMT
- Title: From Theory to Throughput: CUDA-Optimized APML for Large-Batch 3D Learning
- Authors: Sasan Sharifipour, Constantino Álvarez Casado, Manuel Lage Cañellas, Miguel Bordallo López,
- Abstract summary: Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost.<n>APML is a sparse implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn preserves directly in COO form.
- Score: 8.063701386493289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Loss functions are fundamental to learning accurate 3D point cloud models, yet common choices trade geometric fidelity for computational cost. Chamfer Distance is efficient but permits many-to-one correspondences, while Earth Mover Distance better reflects one-to-one transport at high computational cost. APML approximates transport with differentiable Sinkhorn iterations and an analytically derived temperature, but its dense formulation scales quadratically in memory. We present CUDA-APML, a sparse GPU implementation that thresholds negligible assignments and runs adaptive softmax, bidirectional symmetrization, and Sinkhorn normalization directly in COO form. This yields near-linear memory scaling and preserves gradients on the stored support, while pairwise distance evaluation remains quadratic in the current implementation. On ShapeNet and MM-Fi, CUDA-APML matches dense APML within a small tolerance while reducing peak GPU memory by 99.9%. Code available at: https://github.com/Multimodal-Sensing-Lab/apml
Related papers
- Space Filling Curves is All You Need: Communication-Avoiding Matrix Multiplication Made Simple [42.09057806159106]
General Matrix multiplication is the cornerstone of Deep Learning and HPC workloads.<n>Modern platforms with matrix multiplication accelerators exhibit high FLOP/Byte machine balance.<n>In this work we revisit space filling curves (SFC) to alleviate the problem of this cumbersome tuning.<n>We obtain platform-oblivious and shape-oblivious matrix-multiplication schemes that exhibit inherently high degree of data locality.
arXiv Detail & Related papers (2026-01-22T19:56:16Z) - APML: Adaptive Probabilistic Matching Loss for Robust 3D Point Cloud Reconstruction [16.82777427285544]
Training deep learning models for point cloud prediction tasks depends critically on loss functions that measure discrepancies between predicted and ground-truth point sets.<n>We propose Adaptive Probabilistic Matching Loss (APML), a fully differentiable approximation of one-to-one matching.<n>We analytically compute the temperature to guarantee a minimum probability, eliminating manual tuning.
arXiv Detail & Related papers (2025-09-09T19:31:06Z) - Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline [64.42938561167402]
We propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module.<n>This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed.<n>Our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%.
arXiv Detail & Related papers (2025-08-06T16:16:58Z) - FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models [49.397861654088636]
We propose a two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces.<n>We show that our strategy achieves faster runtime and reduced memory usage by up to $25%$ across different model sizes.
arXiv Detail & Related papers (2025-05-23T14:37:00Z) - Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines [28.18421624702502]
We introduce MobiZO, a resource-efficient fine-tuning framework for Large Language Models (LLMs) specifically designed for edge devices.<n>We show that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.<n> Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - Thinking Forward: Memory-Efficient Federated Finetuning of Language Models [21.438831528354513]
Finetuning large language models (LLMs) in federated learning settings requires excessive memory for resource-constrained devices.
In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients.
Spry achieves a low memory footprint, high accuracy, and fast convergence.
arXiv Detail & Related papers (2024-05-24T13:37:48Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Scalable Optimal Transport in High Dimensions for Graph Distances,
Embedding Alignment, and More [7.484063729015126]
We propose two effective log-linear time approximations of the cost matrix for optimal transport.
These approximations enable general log-linear time algorithms for entropy-regularized OT that perform well even for the complex, high-dimensional spaces.
For graph distance regression we propose the graph transport network (GTN), which combines graph neural networks (GNNs) with enhanced Sinkhorn.
arXiv Detail & Related papers (2021-07-14T17:40:08Z) - Fast and Scalable Optimal Transport for Brain Tractograms [4.610968512889579]
We present a new multiscale algorithm for solving regularized Optimal Transport problems on a linear memory footprint.
We show the effectiveness of this approach on brain tractograms modeled either as bundles of fibers or as track density maps.
arXiv Detail & Related papers (2021-07-05T13:28:41Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.